# **CS 152 Computer Architecture and Engineering** # Lecture 2 - Simple Machine Implementations, Microcode Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152 #### **Last Time in Lecture 1** - Computer Architecture >> ISAs and RTL - CS152 is about interaction of hardware and software, and design of appropriate abstraction layers - The end of the uniprocessor era - With simple and specialized cores due to power constraints - Cost of software development becomes a large constraint on architecture (need compatibility) - IBM 360 introduces notion of "family of machines" running same ISA but very different implementations - Six different machines released on same day (April 7, 1964) - "Future-proofing" for subsequent generations of machine #### **Question of the Day** - What purpose does microcode serve today? - Would we have it if designing ISAs from scatch? - Why would we want a complex ISA? - Why do you think motivated CISC and RISC? # Instruction Set Architecture (ISA) - The contract between software and hardware - Typically described by giving all the programmer-visible state (registers + memory) plus the semantics of the instructions that operate on that state - IBM 360 was first line of machines to separate ISA from implementation (aka. *microarchitecture*) - Many implementations possible for a given ISA - E.g., the Soviets build code-compatible clones of the IBM360, as did Amdahl after he left IBM. - E.g.2., today you can buy AMD or Intel processors that run the x86-64 ISA. - E.g.3: many cellphones use the ARM ISA with implementations from many different companies including TI, Qualcomm, Samsung, Marvell, etc. #### Name a Famous ISA! ■ Intel's x86 was initially deployed in 1978 Is alive and well today, though larger ■ Reference manual has 3883 pages! #### Implementations of the x86 - Hundreds of different processors implement x86 - Not just by Intel - Some have extensions that compilers can use if available - But software still compatible if not - More than just intel develop x86 - X86-64 was first specified by AMD in 2000 # ISA to Microarchitecture Mapping - ISA often designed with particular microarchitectural style in mind, e.g., - Accumulator ⇒ hardwired, unpipelined - CISC $\Rightarrow$ microcoded - RISC $\Rightarrow$ hardwired, pipelined - VLIW ⇒ fixed-latency in-order parallel pipelines - JVM $\Rightarrow$ software interpretation - But can be implemented with any microarchitectural style - Intel Ivy Bridge: hardwired pipelined CISC (x86) machine (with some microcode support) - Simics: Software-interpreted SPARC RISC machine - ARM Jazelle: A hardware JVM processor - This lecture: a microcoded RISC-V machine # Today, Microprogramming - To show how to build very small processors with complex ISAs - To help you understand where CISC\* machines came from - Because still used in common machines (IBM360, x86, PowerPC) - As a gentle introduction into machine structures - To help understand how technology drove the move to RISC\* \* "CISC"/"RISC" names much newer than style of machines they refer to. #### **Problem Microprogramming Solves** - Complex ISA to ease programmer and assembler's life - With instructions that have multiple steps - Simple processors such as in order to meet power constraints - (refer to previous lecture) - Turn complex architecture into simple microarchitecture with programmable control - Can also patch microcode #### The Idea An ISA (assembly) instruction, is not what drives the processor's datapath directly Instead, instructions are broken down to FSM states Each state is a microinstruction and outputs control signals #### Microarchitecture: Bus-Based Implementation of ISA Structure: How components are connected. **Static** Behavior: How data moves between components **Dynamic** CS152, Spring 2016 #### Microcontrol Unit Maurice Wilkes, 1954 First used in EDSAC-2, completed 1958 Embed the control logic state table in a memory array #### Microcoded Microarchitecture #### **RISC-V ISA** - RISC design from UC Berkeley - Realistic & complete ISA, but open & simple - Not over-architected for a certain implementation style - Both 32-bit and 64-bit address space variants - RV32 and RV64 - Easy to subset/extend for education/research - RV32IM, RV32IMA, RV32IMAFD, RV32G - Techreport with RISC-V spec available on class website or riscv.org - We'll be using 32-bit and 64-bit RISC-V this semester in lectures and labs. Similar to MIPS you saw in CS61C #### **RV32 Processor State** Program counter (pc) 32x32-bit integer registers (**x0-x31**) • x0 always contains a 0 32 floating-point (FP) registers (**f0-f31**) - each can contain a single- or doubleprecision FP value (32-bit or 64-bit IEEE FP) - •Is an extension FP status register (**fsr**), used for FP rounding mode & exception reporting | XPRLEN-1 | 0 | |------------|---| | x0 / zero | | | x1 / ra | | | x2 | | | х3 | | | x4 | | | x5 | | | x6 | | | x7 | | | x8 | | | <b>x</b> 9 | | | x10 | | | x11 | | | x12 | | | x13 | | | x14 | | | x15 | | | x16 | | | x17 | | | х18 | | | x19 | | | x20 | | | x21 | | | x22 | | | x23 | | | x24 | | | x25 | | | x26 | | | x27 | | | x28 | | | x29 | | | ж30 | | | x31 | | | XPRLEN | | | XPRLEN-1 | 0 | | 63 | 0 | |-----|---| | f0 | | | f1 | | | f2 | | | f3 | | | f4 | | | f5 | | | f6 | | | f7 | | | f8 | | | f9 | | | f10 | | | f11 | | | f12 | | | f13 | | | f14 | | | f15 | | | f16 | | | f17 | | | f18 | | | f19 | | | f20 | | | f21 | | | f22 | | | f23 | | | f24 | | | f25 | | | f26 | | | f27 | | | f28 | | | f29 | | | f30 | | | f31 | | | 64 | | | 31 | 0 | | fsr | | # **RISC-V Instruction Encoding** 16-bit (aa $\neq$ 11) xxxxxxxxxxxxxaa 32-bit (bbb $\neq$ 111) xxxxxxxxxxxbbb11 XXXXXXXXXXXXX 48-bit xxxxxxxxxx011111 $\cdot \cdot \cdot xxxx$ XXXXXXXXXXXXXX xxxxxxxxx0111111 64-bit $\cdots xxxx$ XXXXXXXXXXXXX xxxxxnnnn1111111 (80+16\*nnn)-bit, $nnn \neq 1111$ $\cdots$ XXXXXXXXXXXXX xxxxx11111111111 Reserved for >320-bits $\cdots$ xxxx XXXXXXXXXXXXX - Base instruction set (RV32) always has fixed 32-bit instructions lowest two bits = 11<sub>2</sub> - All branches and jumps have targets at 16-bit granularity (even in base ISA where all instructions are fixed 32 bits) - Still will cause a fault if fetching a 32-bit instruction #### **Four Core RISC-V Instruction Formats** Aligned on a four-byte boundary in memory. There are variants! Sign bit of immediates always on bit 31 of instruction. Register fields never move # **With Variants** | 31 | 30 | 25 | 24 | 21 | 20 | 19 | | 15 | 14 | 12 | 11 | 8 | 7 | 6 | 0 | | |---------|------------|------|--------|----------|--------|-----|-------|-----|-----|-----------|---------|------------------|---------|-----|------|---------| | | funct7 | | | rs2 | | | rs1 | | fu | mct3 | | $_{\mathrm{rd}}$ | | opo | code | R-type | | | | | | | | | | | | | | | | | | | | imm[11 | ] imm[10 | ):5] | imm[4 | :1] i | mm[0] | | rs1 | | ft | mct3 | | rd | | ope | code | I-type | | | | | | | | | | | | | | | | | | | | imm[11 | ] imm[10 | ):5] | | rs2 | | | rs1 | | fu | ınct3 | imm[4] | :1] [ | imm[0] | opo | code | S-type | | | | | | | | | | | | | | | | | | | | imm[12] | ] imm[10 | ):5] | | rs2 | | | rs1 | | ft | mct3 | imm 4 | :1] [ | imm[11] | opo | code | SB-type | | - for | | | faa | -1 | | | F-1 O | | | f | | | | | | | | imm[31] | | in | 100:2 | 0] | | imi | m[19: | 15 | imn | n[14:12] | | rd | | opo | code | U-type | | - fac | | | | -1 - | 1 | | F-1-0 | 1 | | [1 1 1 2] | | | | | | | | imm[20 | ] imm[10 | ):5] | imm[4] | :1] iı | mm[11] | imı | m[19: | 15] | imn | n[14:12] | | rd | | opo | code | UJ-type | #### **Integer Computational Instructions** - I-type - ADDI: adds sign extended 12-bit immediate to rs1 - Actually, all immediates in all instructions are sign extended - SLTI(U): set less than immediate - Shift instructions, etc... | 31 | 20 | ) 19 1 | 5 14 1 | 2 11 | 7 6 | 0 | |------|----------------|----------------------|--------------|----------|----------------|---| | | imm[11:0] | rs1 | funct3 | rd | opcode | | | | 12 | 5 | 3 | 5 | 7 | | | I-iı | mmediate[11:0] | $\operatorname{src}$ | ADDI/SLTI[U] | dest | OP-IMM | | | I-ii | mmediate[11:0] | $\operatorname{src}$ | ANDI/ORI/XO | ORI dest | $OP ext{-}IMM$ | | #### **Integer Computational Instructions** - R-type - Rs1 and rs2 are the source registers. Rd the destination - SLT, SLTU: set less than - SRL, SLL, SRA: shift logical or arithmetic left or right | 31 | 25 2 | 24 20 | 19 1 | 5 14 12 | 2 11 7 | 6 0 | |--------|------|-------|-----------------------|-------------|-----------------------|--------| | funct7 | | rs2 | rs1 | funct3 | rd | opcode | | 7 | | 5 | 5 | 3 | 5 | 7 | | 000000 | 0 | src2 | $\operatorname{src}1$ | ADD/SLT/SLT | $_{ m U-dest}$ | OP | | 000000 | 0 | src2 | $\operatorname{src}1$ | AND/OR/XOR | $_{ m dest}$ | OP | | 000000 | 0 | src2 | $\operatorname{src}1$ | SLL/SRL | $\operatorname{dest}$ | OP | | 010000 | 0 | src2 | $\operatorname{src}1$ | SUB/SRA | $\operatorname{dest}$ | OP | #### **S-Type** 12-bit signed immediate split across two fields | 31 | 30 2 | 5 24 20 | 19 15 | 14 12 | 2 11 8 | 3 7 | 6 | 0 | |---------|-----------|---------|-----------------------|---------|----------|---------|--------|---| | imm[12] | imm[10:5] | rs2 | rs1 | funct3 | imm[4:1] | imm[11] | opcode | | | 1 | 6 | 5 | 5 | 3 | 4 | 1 | 7 | | | offset | [12,10:5] | src2 | src1 | BEQ/BNE | offset[1 | 1,4:1] | BRANCH | | | offset | [12,10:5] | src2 | src1 | BLT[U] | offset[1 | 1,4:1] | BRANCH | | | offset | [12,10:5] | src2 | $\operatorname{src}1$ | BGE[U] | offset[1 | 1,4:1 | BRANCH | | Branches, compare two registers, PC+(immediate<<1) target (Signed offset in multiples of two). Branches do not have delay slot #### **UJ-Type** "J" Unconditional jump, PC+offset target "JAL" Jump and link, also writes PC+4 to x1 Offset scaled by 1-bit left shift – can jump to 16-bit instruction boundary (Same for branches) Also "JALR" where Imm (12 bits) + rd1 = target #### L-Type | 31 | 12 11 | 7 6 0 | |--------------------|-----------------------|--------| | imm[31:12] | rd | opcode | | 20 | 5 | 7 | | U-immediate[31:12] | dest | LUI | | U-immediate[31:12] | $\operatorname{dest}$ | AUIPC | Writes 20-bit immediate to top of destination register. Used to build large immediates. 12-bit immediates are signed, so have to account for sign when building 32-bit immediates in 2-instruction sequence (LUI high-20b, ADDI low-12b) #### **Loads and Stores** | 31 | 25 2 | 24 20 | 19 | 15 14 12 | 11 7 | 7 6 | 0 | |--------------|------------|----------------------|------|----------|-------------|--------|---| | imm[11:5] | | rs2 | rs1 | funct3 | imm[4:0] | opcode | | | 7 | | 5 | 5 | 3 | 5 | 7 | | | offset[11:5] | <b>i</b> ] | $\operatorname{src}$ | base | width | offset[4:0] | STORE | | Store instructions (S-type). Loads (I-type). (rs1 + immediate) addressing Store only uses rs1 and rs2. Rd is only present when being written to #### Where is NOP? addi x0, x0, 0 # **Data Formats and Memory Addresses** #### Data formats: 8-b Bytes, 16-b Half words, 32-b words and 64-b double words #### Some issues Suppose the memory is organized in 32-bit words. Can a word address begin only at 0, 4, 8, ....? # **BACK TO MICROCODING** # A Bus-based Datapath for RISC-V Microinstruction: register to register transfer (17 control signals) ``` MA <= PC means RegSel = PC; enReg=yes; ldMA= yes B <= Reg[rs2] means RegSel = rs2; enReg=yes; ldB = yes ``` # **Memory Module** Assumption: Memory operates independently and is slow as compared to Reg-to-Reg transfers (multiple CPU clock cycles per access) #### **Instruction Execution** #### Execution of a RISC-V instruction involves: - 1. instruction fetch - 2. decode and register fetch - 3. ALU operation - 4. memory operation (optional) - 5. write back to register file (optional) - + the computation of the next instruction address # **Microprogram Fragments** instr fetch: MA, A <= PC $PC \le A + 4$ IR <= Memory</pre> dispatch on Opcode can be treated as a macro ALU: $A \leq Reg[rs1]$ $B \le Reg[rs2]$ Reg[rd] <= func(A,B) do instruction fetch ALUi: $A \leq Reg[rs1]$ B <= Imm Reg[rd] <= Opcode(A,B) do instruction fetch sign extension # Microprogram Fragments (cont.) $A \leq Reg[rs1]$ B <= Imm LW: bz-taken: $MA \leq A + B$ Reg[rd] <= Memory do instruction fetch A <= A - 4 Get original PC back in A $B \le IR$ J: PC <= JumpTarg(A,B) (JAL with rd=x0) do instruction fetch JumpTarg(A,B) = ${A + (B[31:7] << 1)}$ $A \leq Reg[rs1]$ beq: $B \leq Reg[rs2]$ If A==B then go to bz-taken do instruction fetch $A \leq PC$ $A \le A - 4$ Get original PC back in A B <= Blmm << 1 BImm = IR[31:27,16:10] $PC \leq A + B$ do instruction fetch #### RISC-V Microcontroller: first attempt pure ROM implementation # Microprogram in the ROM worksheet | <br>State | Op | zero? | busy | Control points | next-state | |--------------------|-----|-------|------|----------------------|--------------------| | fetch <sub>o</sub> | * | * | * | MA,A <= PC | fetch₁ | | fetch <sub>1</sub> | * | * | yes | | fetch <sub>1</sub> | | fetch <sub>1</sub> | * | * | no | IR <= Memory | fetch <sub>2</sub> | | fetch <sub>2</sub> | * | * | * | PC <= A + 4 | <b>7</b> 5 | | | | | | | | | fetch <sub>2</sub> | ALU | * | * | PC <= A + 4 | ALU <sub>0</sub> | | | | | | | | | | | | | | | | $ALU_0$ | * | * | * | A <= Reg[rs1] | $ALU_1$ | | $ALU_1$ | * | * | * | B <= Reg[rs2] | ALU <sub>2</sub> | | $ALU_2$ | * | * | * | Reg[rd] <= func(A,B) | fetch <sub>0</sub> | # **Microprogram in the ROM** | State Op | zero? | busy | Control points | next-state | |-------------------------|-------|------|----------------------|--------------------| | fetch <sub>o</sub> * | * | * | MA,A <= PC | fetch <sub>1</sub> | | fetch <sub>1</sub> * | * | yes | | fetch₁ | | fetch <sub>1</sub> * | * | no | IR <= Memory | fetch <sub>2</sub> | | fetch <sub>2</sub> ALU | * | * | PC <= A + 4 | $ALU_0$ | | fetch <sub>2</sub> ALUi | * | * | PC <= A + 4 | ALUi <sub>0</sub> | | fetch <sub>2</sub> LW | * | * | PC <= A + 4 | LWo | | fetch <sub>2</sub> SW | * | * | PC <= A + 4 | $SW_0$ | | fetch <sub>2</sub> J | * | * | PC <= A + 4 | $J_0$ | | fetch <sub>2</sub> JAL | * | * | PC <= A + 4 | JAL | | fetch <sub>2</sub> JR | * | * | PC <= A + 4 | $JR_0$ | | fetch <sub>2</sub> JALR | * | * | PC <= A + 4 | JALR <sub>o</sub> | | fetch <sub>2</sub> beq | * | * | PC <= A + 4 | beq <sub>0</sub> | | ••• | | | | | | ALU <sub>0</sub> * | * | * | A <= Reg[rs1] | $ALU_\mathtt{1}$ | | ALU <sub>1</sub> * | * | * | B <= Reg[rs2] | $ALU_2$ | | ALU <sub>2</sub> * | * | * | Reg[rd] <= func(A,B) | fetch <sub>o</sub> | | _ | | | | - | # Microprogram in the ROM cont. | State Op | zero? | busy | Control points | next-state | |-------------------------|-------|------|-------------------|--------------------| | ALUi <sub>0</sub> * | * | * | A <= Reg[rs1] | ALUi₁ | | ALUi₁ * | * | * | B <= Imm | ALUi <sub>2</sub> | | ALUi <sub>2</sub> * | * | * | Reg[rd]<= Op(A,B) | $fetch_0$ | | ••• | | | | | | <b>J</b> <sub>0</sub> * | * | * | A <= A - 4 | $J_{1}$ | | <b>J</b> <sub>1</sub> * | * | * | B <= IR | $J_2$ | | J <sub>2</sub> * | * | * | PC <= JumpTarg(A, | _ | | ••• | | | | | | beq <sub>0</sub> * | * | * | A <= Reg[rs1] | $beq_1$ | | beq <sub>1</sub> * | * | * | B <= Reg[rs2] | beq <sub>2</sub> | | beq <sub>2</sub> * | yes | * | A <= PC | beq <sub>3</sub> | | beq <sub>2</sub> * | no | * | | fetch <sub>o</sub> | | beq <sub>3</sub> * | * | * | A <= A - 4 | beq <sub>4</sub> | | beq <sub>4</sub> * | * | * | B <= Blmm | beq <sub>5</sub> | | beq <sub>5</sub> * | * | * | PC <= A+B | fetch <sub>0</sub> | • • • #### **Size of Control Store** *RISC-V:* $$w = 5+2$$ $c = 17$ $s = ?$ no. of steps per opcode = 4 to 6 + fetch-sequence no. of states ~= (4 steps per op-group ) x op-groups + common sequences $$= 4 \times 8 + 10 \text{ states} = 42 \text{ states} => s = 6$$ Control ROM = $2^{(5+6)}$ x 23 bits approx. 24 Kbytes ### **Reducing Control Store Size** Control store has to be *fast => expensive* - Reduce the ROM height (= address bits) - reduce inputs by extra external logic each input bit doubles the size of the control store - reduce states by grouping opcodes find common sequences of actions - condense input status bits combine all exceptions into one, i.e., exception/no-exception - Reduce the ROM width - restrict the next-state encodingNext, Dispatch on opcode, Wait for memory, ... - encode control signals (vertical microcode) #### **RISC-V Controller V2** ## **Jump Logic** $\mu$ PCSrc = *Case* $\mu$ JumpTypes next=> $\mu$ PC+1 spin => if (busy) then $\mu$ PC else $\mu$ PC+1 fetch => absolute dispatch => op-group ftrue => if (zero) then absolute else $\mu$ PC+1 ffalse => if (zero) then $\mu$ PC+1 else absolute #### Instruction Fetch & ALU: RISC-V-Controller-2 | State | Control points | next-state | |-------------------------------------------------------------|------------------------------------------------|--------------------------| | fetch <sub>0</sub> fetch <sub>1</sub> fetch <sub>2</sub> | MA,A <= PC<br>IR <= Memory<br>PC <= A + 4 | next<br>spin<br>dispatch | | ALU <sub>0</sub> ALU <sub>1</sub> ALU <sub>2</sub> | A <= Reg[rs1] B <= Reg[rs2] Reg[rd]<=func(A,B | next<br>next<br>) fetch | | ALUi <sub>0</sub><br>ALUi <sub>1</sub><br>ALUi <sub>2</sub> | A <= Reg[rs1]<br>B <= Imm<br>Reg[rd]<= Op(A,B) | next<br>next<br>fetch | #### Load & Store: RISC-V-Controller-2 | State | Control points | next-state | |-------------------------------------------------------|----------------------------------|-------------------------| | LW <sub>0</sub><br>LW <sub>1</sub><br>LW <sub>2</sub> | A <= Reg[rs1] B <= Imm MA <= A+B | next<br>next<br>next | | LW <sub>3</sub> | Reg[rd] <= Memory | spin<br>fetch | | SW <sub>0</sub><br>SW <sub>1</sub> | A <= Reg[rs1] B <= Blmm | next<br>next | | SW <sub>2</sub><br>SW <sub>3</sub><br>SW <sub>4</sub> | MA <= A+B<br>Memory <= Reg[rs2 | next<br>] spin<br>fetch | ## **Branches:** RISC-V-Controller-2 | State | Control points | next-state | |------------------|----------------|------------| | beq <sub>0</sub> | A <= Reg[rs1] | next | | $beq_1$ | B <= Reg[rs2] | next | | beq <sub>2</sub> | A <= PC | ffalse | | $beq_3$ | A <= A- 4 | next | | beq <sub>3</sub> | B <= Blmm<<1 | next | | beq <sub>4</sub> | PC <= A+B | fetch | ## Jumps: RISC-V-Controller-2 | State | Control points | next-state | |--------------------------------------------------------------------------------------|----------------------------------------------------------|---------------------------------------| | JALR <sub>0</sub> JALR <sub>1</sub> JALR <sub>2</sub> | A <= Reg[rs1]<br>Reg[1] <= A<br>PC <= A | next<br>next<br>fetch | | JAL <sub>0</sub> JAL <sub>1</sub> JAL <sub>2</sub> JAL <sub>3</sub> JAL <sub>4</sub> | A <= PC Reg[1] <= A A <= A-4 B <= IR PC <= JumpTarg(A,B) | next<br>next<br>next<br>next<br>fetch | J and JR are special cases with rd = x0 #### VAX 11-780 Microcode ``` PIWFUD. [600,1205] MICRO2 1F(12) 26-May-81 14:58:1 VAX11/780 Microcode : PCS 01, FPLA 0D, WCS122 CALL2 .MIC [600.1205] Procedure call : CALLG, CALLS 129744 THERE FOR CALLS OR CALLS, AFTER PROBING THE EXTENT OF THE STACK :29745 :29746 ;-----; CALL SITE FOR MPUSH :29747 CALL.7: D_Q.AND.RC[T2]. STRIP MASK TO BITS 11-0 0 U 11F4, 0811,2035,0180,F910,0000,0CD8 129748 CALL, J/MPUSH PUSH REGISTERS :29749 129750 PORT RETURN FROM MPUSH :29751 CACHE_D[LONG] . PUSH PC 6557K 7763K U 11F5, 0000,003C,0180,3270,0000,134A 129752 LAB_R[SP] ; BY SP 129753 129754 6856K 0 U 134A, 0018,0000,0180,FAF0,0200,134C 129755 CALL.8: R[SP]&VA_LA-K[.8] SUPDATE SP FOR PUSH OF PC & 129756 129757 6856K 0 U 134C, 0800,003C,0180,FA68,0000,11F8 129758 D_R[FP] READY TO PUSH FRAME POINTER 129759 :29760 =0 ------CALL SITE FOR PSHSP 129761 CACHE_D[LONG]. ISTORE FP. 129762 LAB_R[SP], I GET SP AGAIN :29763 SC_K[.FFF0], 1-16 TO SC 6856K 21M U 11F8, 0000,003D,6D80,3270,0084,6CD9 129764 CALL, J/PSHSP 129765 129766 129767 D_R[AP], READY TO PUSH AP 0 U 11F9, 0800,003C,3DF0,2E60,0000,134D 129768 Q_ID[PSL] AND GET PSW FOR COMBINATIO :29769 129770 CACHE_D[LONG], 129771 ISTORE OLD AP 129772 Q_Q_ANDNOT.K[.1F], CLEAR PSW<T,N,Z,V,C> 6856K 21M U 134D, 0019,2024,8DC0,3270,0000,134E 129773 LAB_R[SP] GET SP INTO LATCHES AGAIN 129774 129775 6856K U 134E, 2010,0038,0180,F909,4200,1350 129776 PC&VA_RC[T1], FLUSH.IB ! LOAD NEW PC AND CLEAR OUT 129777 129778 :29779 D_DAL.SC. 1PSW TO D<31116> 129780 Q_RC[T2], RECOVER MASK SC_SC+K[.3], PUT -13 IN SC 6856K U 1350, OD10,0038,ODC0,6114,0084,9351 129782 LOAD. IB, PC_PC+1 START FETCHING SUBROUTINE I 129783 129784 129785 D_DAL.SC. IMASK AND PSW IN D<31:03> Q_PC[T4], GET LOW BITS OF OLD SP TO Q<1:0> U 1351, OD10,0038,F5C0,F920,0084,9352 129787 SC_SC+K[.A] PUT -3 IN SC 129788 ``` ## **Implementing Complex Instructions** $rd \leq M[(rs1)] op (rs2)$ M[(rd)] <= (rs1) op (rs2) $M[(rd)] \le M[(rs1)] \text{ op } M[(rs2)]$ Reg-Memory-src ALU op Reg-Memory-dst ALU op Mem-Mem ALU op #### **Mem-Mem ALU Instructions:** RISC-V-Controller-2 ``` M[(rd)] \le M[(rs1)] \text{ op } M[(rs2)] Mem-Mem ALU op ALUMM<sub>o</sub> MA <= Reg[rs1] next ALUMM_1 A <= Memory spin ALUMM<sub>2</sub> MA \leq Reg[rs2] next ALUMM<sub>3</sub> B <= Memory spin ALUMM<sub>4</sub> MA \leq Reg[rd] next ALUMM<sub>5</sub> Memory <= func(A,B) spin ALUMM<sub>6</sub> fetch ``` Complex instructions usually do not require datapath modifications in a microprogrammed implementation -- only extra space for the control program Implementing these instructions using a hardwired controller is difficult without datapath modifications #### **Performance Issues** Microprogrammed control => multiple cycles per instruction Cycle time ? $$t_C > max(t_{reg-reg}, t_{ALU}, t_{?ROM})$$ Suppose 10 \* $$t_{\mu ROM} < t_{RAM}$$ Good performance, relative to a single-cycle hardwired implementation, can be achieved even with a CPI of 10 # Horizontal vs Vertical μCode - Horizontal μcode has wider μinstructions - Multiple parallel operations per μinstruction - Fewer microcode steps per macroinstruction - Sparser encoding ⇒ more bits - Vertical μcode has narrower μinstructions - Typically a single datapath operation per μinstruction - separate μinstruction for branches - More microcode steps per macroinstruction - More compact $\Rightarrow$ less bits - Nanocoding - Tries to combine best of horizontal and vertical $\mu$ code ## **Nanocoding** - MC68000 had 17-bit μcode containing either 10-bit μjump or 9-bit nanoinstruction pointer - Nanoinstructions were 68 bits wide, decoded to give 196 control signals ### Microprogramming thrived in the Seventies - Significantly faster ROMs than DRAMs were available - For complex instruction sets, datapath and controller were cheaper and simpler - New instructions, e.g., floating point, could be supported without datapath modifications - Fixing bugs in the controller was easier - ISA compatibility across various models could be achieved easily and cheaply Except for the cheapest and fastest machines, all computers were microprogrammed ## **Writable Control Store (WCS)** - Implement control store in RAM not ROM - MOS SRAM memories now almost as fast as control store (core memories/DRAMs were 2-10x slower) - Bug-free microprograms difficult to write - User-WCS provided as option on several minicomputers - Allowed users to change microcode for each processor - User-WCS failed - Little or no programming tools support - Difficult to fit software into small space - Microcode control tailored to original ISA, less useful for others - Large WCS part of processor state expensive context switches - Protection difficult if user can change microcode - Virtual memory required restartable microcode ## Microprogramming is far from extinct - Played a crucial role in micros of the Eighties - DEC uVAX, Motorola 68K series, Intel 286/386 - Plays an assisting role in most modern micros - e.g., AMD Bulldozer, Intel Ivy Bridge, Intel Atom, IBM PowerPC, ... - Most instructions executed directly, i.e., with hard-wired control - Infrequently-used and/or complicated instructions invoke microcode - Patchable microcode common for post-fabrication bug fixes, e.g. Intel processors load μcode patches at bootup - Intel released microcode updates in 2014 and 2015 ### **Question of the Day** - What purpose does microcode serve today? - Would we have it if designing ISAs from scatch? - Why would we want a complex ISA? - Why do you think motivated CISC and RISC?