











| Wishing Universit | MIPS Instruction Formats Summary                       |
|-------------------|--------------------------------------------------------|
| e Mir             | nimum number of instructions required                  |
|                   | Information flow: load/store                           |
|                   | Logic operations: logic and/or/not, shift              |
|                   | Arithmetic operations: addition, subtraction, etc.     |
|                   | Branch operations:                                     |
| a Ins             | tructions have different number of operands: 1, 2, 3   |
| • 32              | bits representing a single instruction                 |
| a Dis             | assembly is simple and starts by decoding opcode field |

| Name       |        |        | Fie    | elds     |          |        | Comments                      |
|------------|--------|--------|--------|----------|----------|--------|-------------------------------|
| Field size | 6 bits | 5 bits | 5 bits | 5 bits   | 5 bits   | 6 bits | All MIPS instructions 32 bits |
| R-format   | ор     | rs     | rt     | rd       | shamt    | funct  | Arithmetic instruction format |
| I-format   | ор     | rs     | rt     | addr     | ess/imme | ediate | Transfer, branch, imm. format |
| J-format   | ор     |        | ta     | rget add | ress     |        | Jump instruction format       |



|                 | linte addreasin          | e         |                      |
|-----------------|--------------------------|-----------|----------------------|
| op              | rs rt                    | Immediate |                      |
| 2. Regist       | er oddressing            |           |                      |
| op              | ra rt                    | rd funct  | Registers            |
| ). Base a<br>op | ddressing<br>rs rt       | Address   | Memory               |
|                 | Reg                      | jater     | + Eyte Halfword Word |
| 1.PC-rel<br>op  | ntive addressin<br>rs rt | Address   | Memory               |
|                 | :                        |           | Word                 |
|                 |                          |           |                      |

| Washington MIPS                                                                                                                                                                                                                                                                      | S Instr                             | uction Subset Core                                                                                                                                                                                                                                                               |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>ADD and SUB         <ul> <li>addu rd, rs, rt</li> <li>subu rd, rs, rt</li> </ul> </li> <li>OR Immediate:         <ul> <li>ori rt, rs, imm16</li> </ul> </li> <li>LOAD and<br/>STORE Word         <ul> <li>lw rt, rs, imm16</li> <li>sw rt, rs, imm16</li> </ul> </li> </ul> | inst<br>ADDU<br>SUBU<br>ORi<br>LOAD | Register Transfers           R[rd] <- R[rs] + R[rt];           PC <- PC + 4           R[rd] <- R[rs] - R[rt];           PC <- PC + 4           R[rt] <- R[rs]   zero_ext(Imm16);           PC <- PC + 4           R[rt] <- MEM[ R[rs] + sign_ext(Imm16)];           PC <- PC + 4 |
| <ul> <li>BRANCH:</li> <li>beq rs, rt, imm16</li> </ul>                                                                                                                                                                                                                               | STORE<br>BEQ                        | MEM[ R[rs] + sign_ext(Imm16) ] <- R[rt];<br>PC <- PC + 4<br>if ( R[rs] == R[rt] ) then<br>PC <- PC + 4 + ([sign_ext(Imm16)]<<2)<br>else PC <- PC + 4                                                                                                                             |

Step 1: Requirements of the Instruction Set

- Memory
  - instruction & data: instruction=MEM[PC]
- Registers (32 x 32)
  - read RS; read RT; Write RT or RD
- PC, what is the new PC?
  - Add 4 or extended immediate to PC
- Extender: sign-extension or 0-extension?
  - Add and Sub register or extended immediate













D PC

Address

Instruction

Memory

Next Addres Logic

Instruction Word

32









































# Performance of Single-Cycle Datapath

# • Time needed per instruction:

- Variable clock cycle time datapath:
- R: 400ps, lw: 600ps, sw: 550ps, branch: 350, j: 200 Same clock cycle time datapath: 600ps

### Average time needed per instruction

- With a variable clock: 447.5ps
- With the same clock: 600ps

# Performance ratio:

**600/447.5 = 1.34** 



- Single Cycle Datapath ensures the execution of any instruction within one clock cycle
  - Functional units must be duplicated if used multiple times by one instruction. E.g. ALU. Why?
  - Functional units can be shared if used by different instructions
- Single cycle datapath is not efficient in time
   Clock Cycle time is determined by the instruction taking the longest time. Eg. lw in MIPS
  - Variable clock cycle time is too complicated.
  - Multiple clock cycles per instruction
  - Pipelining





- Single Cycle Datapath and Control Design
- Pipelined Datapath and Control Design

















# Why Pipeline? • Suppose we execute 100 instructions • Single Cycle Machine • 45 ns/cycle x 1 CPI x 100 inst = 4500 ns • Multicycle Machine • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns • Ideal pipelined machine • 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns



# Can pipelining get us into trouble? Second Structural hazards: attempt to use the same resource two different ways at the same time e.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) Single memory cause structural hazards Data hazards: attempt to use item before it is ready e.g., one sock of pair in dryer and one in washer; can't fold until you get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline Control hazards: attempt to make a decision before condition is evaluated e.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard

take action (or delay action) to resolve hazards

George Vishington

 Perfect pipelining with no hazards → an instruction completes every cycle (total cycles ~ num instructions)
 → speedup = increase in clock speed = num pipeline stages

Slow Down From Stalls

- With hazards and stalls, some cycles (= stall time) go by during which no instruction completes, and then the stalled instruction completes
- Total cycles = number of instructions + stall cycles
- Slowdown because of stalls = 1/ (1 + stall cycles per instr)























| A simplified pipel                                                                  | ine sp                      | eedup                        | equation for                            | Branch:                                     |
|-------------------------------------------------------------------------------------|-----------------------------|------------------------------|-----------------------------------------|---------------------------------------------|
| Pipeline speed                                                                      | in –                        |                              | Pipeline                                | depth                                       |
| r ipenne speedt                                                                     | $p = \frac{1}{2}$           | +Bra                         | nch frequency                           | ×Branch penalty                             |
|                                                                                     |                             |                              | opecuup v.                              |                                             |
| concauling Di                                                                       | anon                        |                              |                                         |                                             |
| scheme pe                                                                           | nalty                       |                              | unpipelined                             | stall                                       |
| scheme pe<br>Stall pipeline                                                         | nalty<br>3                  | 1.60                         | unpipelined<br>3.1                      | <i>stall</i><br>1.0                         |
| scheme pe<br>Stall pipeline<br>Predict taken                                        | nalty<br>3<br>1             | 1.60<br>1.20                 | unpipelined<br>3.1<br>4.2               | <i>stall</i><br>1.0<br>1.33                 |
| scheme pe<br>Stall pipeline<br>Predict taken<br>Predict not taken                   | nalty<br>3<br>1<br>1        | 1.60<br>1.20<br>1.14         | unpipelined<br>3.1<br>4.2<br>4.4        | <i>stall</i><br>1.0<br>1.33<br>1.40         |
| scheme pe<br>Stall pipeline<br>Predict taken<br>Predict not taken<br>Delayed branch | nalty<br>3<br>1<br>1<br>0.5 | 1.60<br>1.20<br>1.14<br>1.10 | unpipelined<br>3.1<br>4.2<br>4.4<br>4.5 | <i>stall</i><br>1.0<br>1.33<br>1.40<br>1.45 |













| Washington<br>University | oftwa                                          | re Scheo<br>H                                                        | luling to <i>i</i><br>azards                                 | Avoid Load                                                           |
|--------------------------|------------------------------------------------|----------------------------------------------------------------------|--------------------------------------------------------------|----------------------------------------------------------------------|
| Try p                    | roduci                                         | ng fast cod                                                          | e for                                                        |                                                                      |
|                          | a = b<br>d = e                                 | +c;<br>–f;                                                           |                                                              |                                                                      |
| assu                     | ming a                                         | , b, c, d ,e, a                                                      | and f in men                                                 | nory.                                                                |
| Slow c                   | LW<br>LW<br>ADD<br>SW<br>LW<br>LW<br>SUB<br>SW | Rb,b<br>Rc,c<br>Ra,Rb,Rc<br>a,Ra<br>Re,e<br>Rf,f<br>Rd,Re,Rf<br>d,Rd | Fast code:<br>LW<br>LW<br>LW<br>ADD<br>LW<br>SW<br>SUB<br>SW | Rb,b<br>Rc,c<br>Re,e<br>Ra,Rb,Rc<br>Rf,f<br>a,Ra<br>Rd,Re,Rf<br>d Rd |





- Structural hazards if the unit is not fully pipelined (divider)
- Frequent Read-After-Write hazard stalls
- Potentially multiple writes to the register file in a cycle
- Write-After-Write hazards because of out-of-order instr completion
- Imprecise exceptions because of o-o-o instr completion

Note: Can also increase the "width" of the processor: handle multiple instructions at the same time: for example, fetch two instructions, read registers for both, execute both, etc.





# Dealing With These Effects

- Multiple writes to the register file: increase the number of ports; stall one of the writers during ID; stall one of the writers during WB (the stall will propagate)
- WAW hazards: detect the hazard during ID and stall the later instruction
- Imprecise exceptions: buffer the results if they complete early or save more pipeline state so that you can return to exactly the same state that you left at

# Summary: Pipelining

## • What makes it easy

- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores; Memory addresses are asigned

# • What makes it hard?

- structural hazards: suppose we had only one memory
- control hazards: need to worry about branch instructions
- data hazards: an instruction depends on a previous instruction
- We'll talk about modern processors and what really makes it hard:
  - trying to improve performance with out-of-order execution, etc.

# Summary & Questions

- Pipelining is a fundamental concept
   multiple steps using distinct resources
- Utilize capabilities of the Datapath by pipelined instruction processing
  - start next instruction while working on the current one
  - limited by length of longest stage (plus fill/flush)
  - detect and resolve hazards

# Questions?