







#### VLIW and EPIC

- VLIW architectures progressed to EPIC
- A quick look at "pure" VLIW approach



























- VLIW Architectures achieve high performance through the combination of a number of key enabling *hardware* and *software* technologies.
  - Optimizing Schedulers (compilers)
  - Static Branch Prediction
  - Symbolic Memory Disambiguation
  - Predicated Execution
  - (Software) Speculative Execution
  - Program Compression

CS 211

### Strengths of VLIW Technology

- Parallelism can be exploited at the instruction level
   Available in both vectorizable and sequential programs.
- Hardware is regular and straightforward
   Most hardware is in the datapath performing useful computations.
- Instruction issue costs scale approximately linearly
  - Potentially very high clock rate
- Architecture is "Compiler Friendly"
  - Implementation is completely exposed 0 layer of interpretation
     Compile time information is easily propagated to run time.
- Exceptions and interrupts are easily managed
- Run-time behavior is highly predictable
  - Allows real-time applications.
  - Greater potential for code optimization.

| D | Weaknesses of VLIW Technology                                                                                                                          |
|---|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | No object code compatibility between generations                                                                                                       |
| • | Program size is large (explicit NOPs)<br>Multiflow machines predated "dynamic memory compression"<br>by encoding NOPs in the instruction memory        |
| • | Compilers are extremely complex<br>– Assembly code is almost impossible                                                                                |
| • | Philosophically incompatible with caching techniques                                                                                                   |
| • | VLIW memory systems can be very complex - Simple memory systems may provide very low performance - Program controlled multi-layer, multi-banked memory |
| • | Parallelism is underutilized for some algorithms.                                                                                                      |

| Attributes                                     | Superscalar | VLIW         |
|------------------------------------------------|-------------|--------------|
| Multiple                                       | yes         | yes          |
| matenet ions/cycle                             | no          | yes          |
| Abstrational/instruction                       | yes         | no           |
| Refisition e analysis of register dependencies | yes         | no           |
| Run-time analysis of<br>memory dependencies    | maybe       | occasionally |
| Runtime instruction                            | maybe       | no           |
| reordering                                     | (Resv.      |              |
| Runtime register                               | Stavions)   | maybe        |
| allocation                                     | 1           | 1            |









- The first implementation of the IA-64 ISA





# EPIC: Key Concepts Speculation Predication (and parallel compares) Large (Rotating) Register Files







EPIC Concepts: Predication
 Branching is generally bad because it
 interference with the ideal minuting model of

- interferes with the ideal pipeline model of reading instructions while earlier inst is executed
- ideally, if we eliminate branches then this problem disappears
- Predication is process by which branches are eliminated



































































#### Cool Features of IA64

- Predicated execution
- Speculative, non-faulting Load instruction
- Software-assisted branch prediction
- Register stack
- Rotating register frame
- Software-assisted memory hierarchy

Mostly adapted from mechanisms that had existed for VLIWs

CS 211

#### Itanium Specifics 6-wide 10-stage pipeline Fetch 2 bundles per cycle with the help of BP into a 8-bundle deep fetch queue 512-entry 2-level BPT, 64-entry BTAC, 4 TAR, and a RSB Issue up to 2 bundles per cycle some mixes of 6 instructions e.g. (MFI,MFI) or (MIB,MIB<sub>b</sub>) Can issue as little as one syllable per cycle on RAW hazard interlock or structural hazard (scoreboard for RAW detection) 8R-6W 128 Entry Int. GPR, 128 82-bit FPR, 64 predicate reg's 4 globally-bypassed single-cycle integer ALUs with MMX, 2 FMACs, 2 LSUs, 3 BUs Can execute IA-32 software directly Intended for high-end server and workstations • • You can buy one now, finally. CS 211

SS 211

## EPIC and Compiler Optimization

- EPIC requires dependency free "scheduled code"
- Burden of extracting parallelism falls on compiler
- success of EPIC architectures depends on efficiency of Compilers!!
- We provide overview of Compiler Optimization techniques (as they apply to EPIC/ILP)

