### S

### CS 211: Computer Architecture

## Introduction to ILP Processors & Concepts

### **Course Outline**

- Introduction: Trends, Performance models
- Review of computer organization and ISA implementation
- Overview of Pipelining
- ILP Processors: Superscalar Processors – Next! ILP Intro and Superscalar
- ILP: EPIC/VLIW Processors
- Compiler optimization techniques for ILP processors – getting max performance out of ILP design

CS211 2

• Part 2: Other components- memory, I/O.





CS211











| Functional Unit                                | Operations Performed      | Lotomore |
|------------------------------------------------|---------------------------|----------|
|                                                |                           | Latency  |
| nteger Unit 1                                  | Integer ALU Operations    | 1        |
|                                                | Integer Multiplication    | 2        |
|                                                | Loads                     | 2        |
|                                                | Stores                    | 1        |
| nteger Unit 2 /                                | Integer ALU Operations    | 1        |
| Branch Unit                                    | Integer Multiplication    | 2        |
|                                                | Loads                     | 2        |
|                                                | Stores                    | 1        |
|                                                | Test-and-branch           | 1        |
| Floating-point Unit 1<br>Floating-point Unit 2 | Floating Point Operations | 3        |





















### **Sequential Architecture**

- Program contains no explicit information regarding dependencies that exist between instructions
- Dependencies between instructions must be determined by the hardware
  - It is only necessary to determine dependencies with sequentially preceding instructions that have been issued but not yet completed
- Compiler may re-order instructions to facilitate the hardware's task of extracting parallelism







- Dataflow processors are representative of Dependence architectures
  - Execute instruction at earliest possible time subject to availability of input operands and functional units
  - Dependencies communicated by providing with each instruction a list of all successor instructions
  - As soon as all input operands of an instruction are available, the hardware fetches the instruction
  - The instruction is executed as soon as a functional unit is available
- Few Dataflow processors currently exist















































### **Hardware Features to Support ILP**

- Speculative Execution
  - Expensive in hardware
  - Alternative is to perform speculative code motion at compile time
    - » Move operations from subsequent blocks up past branch operations into proceeding blocks
  - Requires less demanding hardware
    - » A mechanism to ensure that exceptions caused by speculatively scheduled operations are reported if and only if flow of control is such that they would have been executed in the nonspeculative version of the code
    - » Additional registers to hold the speculative execution state







# Recall our old friend from Review of pipelining

- CPI = ideal CPI + Structural Stalls + Data Hazard Stalls + Control Stalls
  - Ideal (pipeline) CPI: measure of the maximum performance attainable by the implementation
  - <u>Structural hazards</u>: HW cannot support this combination of instructions
  - <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline
  - <u>Control hazards</u>: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)







### Name Dependence #1: Anti-dependence

- Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence
- Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

Called an "anti-dependence" by compiler writers. This results from reuse of the name "r1"

• If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard

CS211 63

## Name Dependence #2: Output dependence Instr, writes operand before Instr, writes it. I: sub r1, r4, r3 J: add r1, r2, r3 K: mul r6, r1, r7 Called an "output dependence" by compiler writers This also results from the reuse of name "r1"

• If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard











### **Data Flow**

- Data flow: actual flow of data values among instructions that produce results and those that consume them
  - branches make flow dynamic, determine which instruction is supplier of data

CS211 70

- Example:
  - DADDU
     R1, R2, R3

     BEQZ
     R4, L

     DSUBU
     R1, R5, R6

     L:
     ...

     OR
     R7, R1, R8
- OR depends on DADDU or DSUBU? Must preserve data flow on execution

**Ok.. ILP through Software/Compiler** Loop Unrolling: A Simple S/W Technique · Ask what you (SW/compiler) can do for the · Parallelism within one "basic block" is HW? minimal - Need to look at larger regions of code to schedule · Quick look at one SW technique to Loops are very common - Decrease CPU time - expose more ILP - Number of iterations, same tasks in each iteration • Simple Observation : If iterations are independent, then multiple iterations can be executed in parallel • Loop Unrolling- Unrolling multiple iterations of a loop to create more instructions to schedule CS211 71 CS211 72

|   | Examp              | <b>le FP</b>     | Lo                | op: W                                   | here are t                 | he Hazards?                                   |          |
|---|--------------------|------------------|-------------------|-----------------------------------------|----------------------------|-----------------------------------------------|----------|
|   | Loop:              | SD               | F4,<br>0(F<br>R1, | 0(R1)<br>F0,F2<br>1),F4<br>R1,8<br>Loop | ;add scalar<br>;store resu | r from F2<br>alt<br>pointer 8B (DW)<br>==zero |          |
|   | Instru<br>produ    | ction<br>cing re | sult              | •                                       | result                     | Latency in<br>clock cycles                    |          |
|   | FP AL              | U op             |                   | Anothe                                  | er FP ALU op               | 3                                             |          |
|   | FP ALU op Store of |                  | Store c           | louble                                  | 2                          |                                               |          |
|   | Load of            | double           |                   | FP ALL                                  | Jop                        | 1                                             |          |
|   | Load of            | double           |                   | Store c                                 | louble                     | 0                                             |          |
|   | Intege             | r op             |                   | Integer                                 | ор                         | 0                                             |          |
| • | Where              | are th           | ne s              | talls?                                  |                            |                                               |          |
|   |                    |                  |                   |                                         |                            |                                               | CS211 73 |

| FP Loo                                                                 | <b>p Ha</b> z                             | zar               | ds                                                                           |                                                                                      |                                                     |      |          |
|------------------------------------------------------------------------|-------------------------------------------|-------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|-----------------------------------------------------|------|----------|
| Loop:                                                                  | SD                                        | F4,<br>0(R<br>R1, | F0,F2                                                                        | ;F0=vector<br>;add scalar<br>;store resu<br>;decrement<br>;branch R1!<br>;delayed br | in F2<br>llt<br>pointer 8B<br>=zero                 | (DW) |          |
| Instruct<br>product<br>FP ALL<br>FP ALL<br>Load d<br>Load d<br>Integer | ing res<br>J op<br>J op<br>ouble<br>ouble | sult              | Instruct<br>using re<br>Another<br>Store de<br>FP ALU<br>Store de<br>Integer | esult<br>r FP ALU op<br>ouble<br>op<br>ouble                                         | Latency in<br>clock cycles<br>3<br>2<br>1<br>0<br>0 |      |          |
|                                                                        |                                           |                   |                                                                              |                                                                                      |                                                     |      | CS211 74 |

| <b>FP Loo</b>       | p Sho         | owing Sta                   | alls       |                            |          |
|---------------------|---------------|-----------------------------|------------|----------------------------|----------|
| 1 Loop:             |               | F0,0(R1)                    | ;F0=vecto  | r element                  |          |
| 2<br>3              | stall<br>ADDD | F4 F0 F2                    | ;add scal  | ar in F2                   |          |
| 4                   | stall         | 11,20,22                    | , add Bear |                            |          |
| 5                   | stall         |                             |            |                            |          |
| 6                   | SD            | 0(R1), <mark>F4</mark>      | ;store re  | sult                       |          |
| 7                   | SUBI          | R1,R1,8                     | ;decremen  | t pointer 8B (DW)          |          |
| 8                   | BNEZ          | R1,Loop                     | ;branch R  | 1!=zero                    |          |
| 9                   | stall         |                             | ;delayed [ | branch slot                |          |
| Instruct<br>produci | ••••          | Instruction<br>It using res |            | Latency in<br>clock cycles |          |
| FP ALU              | ор            | Another I                   | FP ALU op  | 3                          |          |
| FP ALU              | ор            | Store dou                   | uble       | 2                          |          |
| Load do             | uble          | FP ALU o                    | р          | 1                          |          |
| • 9 cl              | ocks:         | Rewrite c                   | ode to mi  | nimize stalls?             |          |
|                     |               |                             |            |                            | CS211 75 |

| Revised             | I FP Lo  | oop Minimizing Stalls                                  |          |
|---------------------|----------|--------------------------------------------------------|----------|
| 1 Loop:             | LD       | F0,0(R1)                                               |          |
| 2                   | stall    |                                                        |          |
| 3                   | ADDD     | <b>F4</b> , <b>F0</b> , <b>F2</b>                      |          |
| 4                   | SUBI     | R1,R1,8                                                |          |
| 5                   | BNEZ     | R1,Loop ;delayed branch                                |          |
| 6                   | SD       | 8(R1),F4 ;altered when move past SUBI                  |          |
| Swap BNE            | EZ and   | I SD by changing address of SD                         |          |
| Instruct<br>produci |          | Instruction Latency in<br>It using result clock cycles |          |
| FP ALU              | ор       | Another FP ALU op 3                                    |          |
| FP ALU              | ор       | Store double 2                                         |          |
| Load do             | uble     | FP ALU op 1                                            |          |
| 6 clocks: Ur        | nroll lo | op 4 times code to make faster?                        | CS211 76 |

|       | way) |             |                   |
|-------|------|-------------|-------------------|
| 1 Loc | p:LD | F0,0(R1)    | Rewrite loop to   |
| 2     | ADDD | F4,F0,F2    | minimize stalls?  |
| 3     | SD   | 0(R1),F4    | ;drop SUBI & BNEZ |
| 4     | LD   | F6,-8(R1)   |                   |
| 5     | ADDD | F8,F6,F2    |                   |
| 6     | SD   | -8(R1),F8   | ;drop SUBI & BNEZ |
| 7     | LD   | F10,-16(R1) |                   |
| 8     | ADDD | F12,F10,F2  |                   |
| 9     | SD   | -16(R1),F12 | drop SUBI & BNEZ  |
| 10    | LD   | F14,-24(R1) |                   |
| 11    | ADDD | F16,F14,F2  |                   |
| 12    | SD   | -24(R1),F16 |                   |
| 13    | SUBI | R1,R1,#32   | ;alter to 4*8     |
| 14    | BNEZ | R1,LOOP     |                   |
| 15    | NOP  |             |                   |

#### **Unrolled Loop That Minimizes Stalls** 1 Loop: LD F0,0(R1) What assumptions made 2 F6,-8(R1) LD when moved code? 3 F10,-16(R1) LD - OK to move store past 4 LD F14,-24(R1) SUBI even though changes 5 ADDD F4,F0,F2 register 6 ADDD F8,F6,F2 - OK to move loads before 7 ADDD F12,F10,F2 stores: get right data? 8 ADDD F16,F14,F2 - When is it safe for 9 0(R1),F4 SD compiler to do such 10 SD -8(R1),F8 changes? 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP ; 8-32 = -24 14 SD 8(R1),F16 14 clock cycles, or 3.5 per iteration When safe to move instructions? CS211 78

### Compiler Perspectives on Code Movement

- Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given pipeline
- · Try to schedule to avoid hazards
- (True) Data dependencies (RAW if a hazard for HW)
- Instruction i produces a result used by instruction j, or
- Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i.
- If dependent, can't execute in parallel
- · Easy to determine for registers (fixed names)
- · Hard for memory:
  - Does 100(R4) = 20(R6)?
  - From different loop iterations, does 20(R6) = 20(R6)?

CS211 79

### Where are the data dependencies?

| 1 Loop: | LD   | F0,0(R1) |                              |  |
|---------|------|----------|------------------------------|--|
| 2       | ADDD | F4,F0,F2 |                              |  |
| 3       | SUBI | R1,R1,8  |                              |  |
| 4       | BNEZ | R1,Loop  | ;delayed branch              |  |
| 5       | SD   | 8(R1),F4 | ;altered when move past SUBI |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          |                              |  |
|         |      |          | CS211 80                     |  |

| 1 Loc | p:LD | -F0,0(R1)  |                   |  |
|-------|------|------------|-------------------|--|
| 2     | ADDD | F4,F0,F2   |                   |  |
| 3     | SD   | 0(R/1),F4  | ;drop SUBI & BNEZ |  |
| 4     | LD   | -F0,-8(R1) |                   |  |
| 2     | ADDD | F4,F0,F2   |                   |  |
| 3     | SD   | -8(R1),F4  | drop SUBI & BNEZ; |  |
| 7     | LD   | F0,-16(R1) |                   |  |
| 8     |      | -F4,F0,F2  |                   |  |
| 9     | SD   | -16(R1),F4 | drop SUBI & BNEZ; |  |
| 10    | LD   | F0,-24(R1) |                   |  |
| 11    | ADDD | F4,F0,F2   |                   |  |
| 12    | SD   | -24(R1),F4 |                   |  |
| 13    | SUBI | R1,R1,#32  | ;alter to 4*8     |  |
| 14    | BNEZ | R1,LOOP    |                   |  |
| 15    | NOP  |            |                   |  |

### Where are the name dependencies?

| 1 Loo | p:LD    | F0,0(R1)      |                   |          |
|-------|---------|---------------|-------------------|----------|
| 2     | ADDD    | F4,F0,F2      |                   |          |
| 3     | SD      | 0(R1),F4      | ;drop SUBI & BNEZ |          |
| 4     | LD      | F6,-8(R1)     |                   |          |
| 5     | ADDD    | F8,F6,F2      |                   |          |
| 6     | SD      | -8(R1),F8     | ;drop SUBI & BNEZ |          |
| 7     | LD      | F10,-16(R1)   |                   |          |
| 8     | ADDD    | F12,F10,F2    |                   |          |
| 9     | SD      | -16(R1),F12   | ;drop SUBI & BNEZ |          |
| 10    | LD      | F14,-24(R1)   |                   |          |
| 11    | ADDD    | F16,F14,F2    |                   |          |
| 12    | SD      | -24(R1),F16   |                   |          |
| 13    | SUBI    | R1,R1,#32     | ;alter to 4*8     |          |
| 14    | BNEZ    | R1,LOOP       |                   |          |
| 15    | NOP     |               |                   |          |
| Cal   | led "re | gister renami | ing"              |          |
|       |         |               |                   | CS211 82 |

### Compiler Perspectives on Code Movement

- Again Name Dependenceis are Hard for Memory Accesses
  - Does 100(R4) = 20(R6)?
  - From different loop iterations, does 20(R6) = 20(R6)?
- Our example required compiler to know that if R1 doesn't change then:

```
0(R1) \neq -8(R1) \neq -16(R1) \neq -24(R1)
```

There were no dependencies between some loads and stores so they could be moved by each other

### Compiler Perspectives on Code Movement

- Final kind of dependence called control dependence
- Example

if p1 {S1;};

if p2 {S2;};

S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1.

CS211 84

### Compiler Perspectives on Code Movement

- Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don't exchange data
- Antidependence (WAR if a hazard for HW)
  - Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first
- Output dependence (WAW if a hazard for HW)
  - Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

### Compiler Perspectives on Code Movement

- Two (obvious) constraints on control dependences:
  - An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch.
  - An instruction that is not control dependent on a branch cannot be moved to after the branch so that its execution is controlled by the branch.
- Control dependencies relaxed to get parallelism; get same effect if preserve order of exceptions (address in register checked by branch before use) and data flow (value in register depends on branch)
  - Can "violate" the two constraints above by placing some 'checks' in place ?
    - » Branch prediction, speculation

CS211 86

|        | Where | e are the c | ontrol d | <b>epend</b> | encies? | ?        |
|--------|-------|-------------|----------|--------------|---------|----------|
| 1 Loop | :LD   | F0,0(R1)    |          |              |         |          |
| 2      | ADDD  | F4,F0,F2    |          |              |         |          |
| 3      | SD    | 0(R1),F4    |          |              |         |          |
| 4      | SUBI  | R1,R1,8     |          |              |         |          |
| 5      | BEQZ  | R1,exit     |          |              |         |          |
| 6      | LD    | F0,0(R1)    |          |              |         |          |
| 7      | ADDD  | F4,F0,F2    |          |              |         |          |
| 8      | SD    | 0(R1),F4    |          |              |         |          |
| 9      | SUBI  | R1,R1,8     |          |              |         |          |
| 10     | BEQZ  | R1,exit     |          |              |         |          |
| 11     | LD    | F0,0(R1)    |          |              |         |          |
| 12     | ADDD  | F4,F0,F2    |          |              |         |          |
| 13     | SD    | 0(R1),F4    |          |              |         |          |
| 14     | SUBI  | R1,R1,8     |          |              |         |          |
| 15     | BEQZ  | R1,exit     |          |              |         |          |
|        |       |             |          |              |         |          |
|        |       |             |          |              |         |          |
|        |       |             |          |              |         | CS211 87 |

| When Safe to Unroll Loop?                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>• Example: Where are data dependencies?<br/>(A,B,C distinct &amp; nonoverlapping)<br/>for (i=1; i&lt;=100; i=i+1) {<br/>A[i+1] = A[i] + C[i];</pre>                                                                                                                                                                                                                                                                                                                          |
| <ol> <li>S2 uses the value, A[i+1], computed by S1 in the same iteration.</li> <li>S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].<br/>This is a "loop-carried dependence": between iterations</li> <li>Implies that iterations are dependent, and can't be executed in parallel</li> <li>Not the case for our prior example; each iteration was</li> </ol> |
| distinct                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |



### Next... Superscalar Processor Design

- How to deal with instruction flow
   Dynamic Branch prediction
- How to deal with register/data flow – Register renaming
- Dynamic branch prediction
- Dynamic scheduling using Tomasulo method