



















# 0

# Machine-Independent Optimizations

- Dataflow Analysis and Optimizations
   Constant propagation
  - Copy propagation
  - Value numbering
- Elimination of common subexpression
- Dead code elimination
- Stength reduction
- Function/Procedure inlining













| D | Imp       | ortanc | e of Lo    | op Opti    | mizations    |
|---|-----------|--------|------------|------------|--------------|
|   | Program   | No. of | Static     | Dynamic    | <u>% of</u>  |
|   |           | Loops  | B.B. Count | B.B. Count | <u>Total</u> |
|   | nasa7     | 9      |            | 322M       | 64%          |
|   |           | 16     |            | 362M       | 72%          |
|   |           | 83     |            | 500M       | ~100%        |
|   | matrix300 | 1      | 17         | 217.6M     | 98%          |
|   |           | 15     | 96         | 221.2M     | 98+%         |
|   | tomcatv   | 1      | 7          | 26.1M      | 50%          |
|   |           | 5      | 22         | 52.4M      | 99+%         |
|   |           | 12     | 96         | 54.2M      | ~100%        |



| 0      |                             | Fu                           | unction inlining             |
|--------|-----------------------------|------------------------------|------------------------------|
| • R    | eplace fun                  | ction calls wit              | h function body              |
| • In   | crease co                   | mnilation scor               | ne (increase II P)           |
|        | a constant pr               | onparation common            |                              |
| е<br>_ | .y. constant pr             | opayation, common            | i subexpression              |
| • R    | educe fun                   | ction call over              | head                         |
| е      | .g. passing arg             | juments, reg. saves          | and restores                 |
|        | [W.M. Hwu, 19               | 91 (DEC 3100)]               |                              |
|        | Program                     | In-line Speedup              | in-line Code Expansion       |
|        | ссср                        | 1.06                         | 1.25                         |
|        | compress                    | 1.05                         | 1.00+                        |
|        | equ                         | 1.12                         | 1.21                         |
|        | espresso                    | 1.07                         | 1.09                         |
|        |                             |                              |                              |
|        | lex                         | 1.02                         | 1.06                         |
|        | lex<br>tbl                  | 1.02<br>1.04                 | 1.06<br>1.18                 |
|        | lex<br>tbl<br>xlisp         | 1.02<br>1.04<br>1.46         | 1.06<br>1.18<br>1.32         |
|        | lex<br>tbl<br>xlisp<br>yacc | 1.02<br>1.04<br>1.46<br>1.03 | 1.06<br>1.18<br>1.32<br>1.17 |

















|       | BB < 2 > |                                |
|-------|----------|--------------------------------|
| < 0 > | op 10    | op 12 op 15 op 27              |
| < 1 > | op 18    |                                |
| < 2 > |          |                                |
| < 3 > | op 14    | op 16                          |
| < 4 > |          |                                |
| < 5 > |          |                                |
| < 6 > | op 17    | After Scheduling               |
| < 7 > | op 20    | Anter Beneddling               |
| < 8 > | op 22    | (Prior to Register Allocation) |
| < 9 > |          |                                |















|                            | Register Allocation and Assignment                                                                                                              |
|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
| • Allc<br>reg<br>wh<br>reg | cation: identifying program values (virtual<br>isters, live ranges) and program points at<br>ich values should be stored in a physical<br>ister |
| • Pro<br>reg               | gram values that are not allocated to isters are said to be <i>spilled</i>                                                                      |
| • Ass                      | ignment: identifying which physical register                                                                                                    |

 Assignment: identifying which physical register should hold an allocated value at each program point.

CS 211

C







# Instruction timing

- Not all instructions take the same amount of time.
  - Hard to get execution time data for instructions.
- Instruction execution times are not independent.

CS 211

• Execution time may depend on operand values.

# Co

# Trace-driven performance analysis

- Trace: a record of the execution path of a program.
- Trace gives execution path for performance analysis.
- A useful trace:
  - requires proper input values;
  - is large (gigabytes).
- Trace generation in H/W or S/W?

















| S      |  |  |  |
|--------|--|--|--|
|        |  |  |  |
|        |  |  |  |
|        |  |  |  |
|        |  |  |  |
| CS 211 |  |  |  |



















### **Control Flow Graphs**

 Motivation: language-independent and machineindependent representation of control flow in programs used in high-level and low-level code optimizers. The flow graph data structure lends itself to use of several important algorithms from graph theory.

CS 211

# **Control Flow Graph: Definition**

A control flow graph CFG = ( $N_c$ ;  $E_c$ ;  $T_c$ ) consists of

C

CS 211

- N<sub>c</sub>, a set of nodes. A node represents a straight-line sequence of operations with no intervening control flow i.e. a basic block.
- $E_c \subseteq N_c \ge N_c \ge Labels$ , a set of labeled edges.
- T<sub>c</sub>, a node type mapping. T<sub>c</sub>(n) identies the type of node n as one of: START, STOP, OTHER.

We assume that CFG contains a unique START node and a unique STOP node, and that for any node N in CFG, there exist directed paths from START to N and from N to STOP.







|                                | Data Dependence: Definition                                                                                 |
|--------------------------------|-------------------------------------------------------------------------------------------------------------|
| A <i>data</i><br>nodes<br>only | dependence, $S_1 \rightarrow S_2$ , exists between CFG $S_1$ and $S_2$ with respect to variable X if and if |
| 1. there<br>inter              | e exists a path <i>P</i> : $S_1 \rightarrow S_2$ in <i>CFG</i> , with no vening write to <i>X</i> , and     |
| 2. at lea                      | ast one of the following is true:                                                                           |
| <b>(a)</b> (fl                 | ow) X is written by S <sub>1</sub> and later read by S <sub>2</sub> , or                                    |
| <b>(b)</b> (a<br>or            | anti) X is read by S <sub>1</sub> and later is written by S <sub>2</sub>                                    |
| <b>(c)</b> (o                  | output) X is written by S <sub>1</sub> and later written by                                                 |







# Control Dependence Analysis

We want to capture two related ideas with control dependence analysis of a CFG:

 Node Y should be control dependent on node X if node X evaluates a predicate (conditional branch) which can control whether node Y will subsequently be executed or not. This idea is useful for determining whether node Y needs to wait for node X to complete, even though they have no data dependences.

CS 211



# Control Dependence Analysis (contd.)

2. Two nodes, Y and Z, should be identified as having identical control conditions if in every run of the program, node Y is executed if and only if node Z is executed. This idea is useful for determining whether nodes Y and Z can be made adjacent and executed concurrently, even though they may be far apart in the CFG.





- All instructions specified as part of the input must be executed.
- Allows deterministic modeling of the input.
- No "branch probabilities" to contend with; makes problem space easy to optimize using classical methods.







# The General Instruction Scheduling Problem (Contd.)

- Feasible Schedule: A specification of a start time for each instruction such that the following constraints are obeyed:
- 1. Resource: Number of instructions of a given type at any time < corresponding number of FUs.
- 2. Precedence and Latency: For each predecessor *j* of an instruction *i* in the DAG, *i* is the started only cycles after *j* finishes where k is the latency labeling the edge (*j*,*i*),
- Output: A schedule with the minimum overall completion time (makespan).

# Drawing on Deterministic Scheduling

- <u>Canonical List Scheduling Algorithm</u>:
- 1. Assign a *Rank* (priority) to each instruction (or node).
- 2. Sort and build a priority *list* of the instructions in non-decreasing order of Rank.
  - Nodes with smaller ranks occur earlier





| Preceder                                                                                                                                                           | nce Constraints                                                                                                                                                     |  |  |  |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| <ul> <li>Minimum required ordering and latency<br/>between definition and use</li> </ul>                                                                           |                                                                                                                                                                     |  |  |  |  |
| <ul> <li>Precedence graph         <ul> <li>Nodes: instructions</li> <li>Edges (a→b): a precedes b</li> <li>Edges are annotated with minimum</li> </ul> </li> </ul> | i1: l.s f2, 4(r2)<br>i2: l.s f0, 4(r5)<br>i3: fadd.s f0, f2, f0<br>i4: s.s f0, 4(r6)<br>i5: l.s f14, 8(r7)                                                          |  |  |  |  |
| $ \begin{split} w[i+k], & ip = z[i].rp + z[m+i].r \\ w[i+j].rp = c[k+1].rp^* \\ & (z[i].rp - z[m+i].rp) \\ & e[k+1].ip * \\ & (z[i].ip - z[m+i].ip) \end{split} $  | p;<br>i6: 1.s f6, 0(r2)<br>i7: 1.s f5, 0(r3)<br>i8: fsub.s f5, f6, f5<br>i9: fmul.s f4, f14, f5<br>i10: 1.s f15, 12(r7)<br>i11: 1.s f7, 4(r2)<br>i12: 1.s f8, 4(r3) |  |  |  |  |
| FFT code fragment                                                                                                                                                  | i13: fsub.s f8, f7, f8<br>i14: fmul.s f8, f15, f8<br>i15: fsub.s f8, f4, f8<br>i16: s.s f8, 0(r8)                                                                   |  |  |  |  |







# The Value of Greedy List Scheduling (Contd.)

- 1. On the first scan: *i*1 which is the first step.
- 2. On the second and third scans and out of the list order, respectively *i*4 and *i*5 to correspond to steps two and three of the schedule.
- 3. On the fourth and fifth scans, *i*2 and *i*3 respectively scheduled in steps four and five.





- Number of descendants in precedence graph
- Maximum latency from root node of precedence graph
- Length of operation latency
- Ranking of paths based on importance
- Combination of above







| S       | Scalar            | Scheduling        | Example |
|---------|-------------------|-------------------|---------|
| Cycle   | Ready list        | Schedule          | Code    |
| 1       | 1,2,4,3,5         | 1                 | ld a    |
| 2       | 1,2,4,3,5         | 1                 | ld a    |
| 3       | 2,4,3,5           | 2                 | ld b    |
| 4       | 2,4,3,5           | 2                 | ld b    |
| 5       | 4,3,5             | 4                 | a+b     |
| 6       | 3,5               | 3                 | ld c    |
| 7       | 3 <mark>,5</mark> | 3                 | ld c    |
| 8       | 5                 | 5                 | mult    |
| 9       | 5                 | 5                 | mult    |
| 10      |                   |                   |         |
| 11      |                   |                   |         |
| 12      | Ready inst are    | green             |         |
| CS 2/13 | Red indicates     | not ready         |         |
| 4.4     | Black indicate    | s under execution |         |

|       |                   |          | Resources |         |     |      |       |
|-------|-------------------|----------|-----------|---------|-----|------|-------|
| Cycle | Ready list        | Schedule | Mem       | Me<br>m | ALU |      | Code  |
| 1     | 1,2,4,3,5         | 1,2      | X         | x       |     | ld a | ld b  |
| 2     | 1,2,4,3,5         | 1,2      | x         | X       |     | ld a | ld b  |
| 3     | 4,3,5             | 4,3      | X         |         | Х   | ld c | (a+b) |
| 4     | 3, <mark>5</mark> | 3        | Х         |         |     | ld c |       |
| 5     | 5                 | 5        |           |         | Х   |      | mult  |
| 6     | 5                 | 5        |           |         | Х   |      | mult  |









A Critical Choice: The Rank Function for Prioritizing Nodes







# An Example Rank Function (Contd.)

- (b) For a node at level 1, construct a new label which is the concentration of all its successors connected by a latency 1 edge.
  - Edge i2 to i4 in this case.
- (c) The empty symbol is associated with latency zero edges.
  - Edges i3 to i4 for example.

CS 211

# An Example Rank Function (Contd.)

- (d) The result is that i2 and i3 respectively get new labels and hence ranks '= > '' =
- Note that '= > " = i.e., labels are drawn from a totally ordered alphabet.
- (e) Rank of *i*1 is the concentration of the ranks of its immediate successors *i*2 and *i*3 i.e., it is "'= '| ".
- 3. The resulting sorted list is (optimum) *i*1, cs*i*2, *i*3, *i*4.





| S  | Traces                                                                                                                                        |
|----|-----------------------------------------------------------------------------------------------------------------------------------------------|
| •  | "Trace Scheduling: A Technique for Global<br>Microcode Compaction," J.A. Fisher, <i>IEEE<br/>Transactions on Computers</i> , Vol. C-30, 1981. |
| •  | Main Ideas:                                                                                                                                   |
|    | <ul> <li>Choose a program segment that has no cyclic<br/>dependences.</li> </ul>                                                              |
|    | • Choose <i>one</i> of the paths out of each branch that is encountered.                                                                      |
|    | more                                                                                                                                          |
| CS | 211                                                                                                                                           |

































|                                                                         | Execution Tim                                                                    | e and Spill-cos                                                        | <u>it</u> |
|-------------------------------------------------------------------------|----------------------------------------------------------------------------------|------------------------------------------------------------------------|-----------|
| <ul> <li>Spilling: I<br/>register<br/>registers<br/>range ne</li> </ul> | Moving a variable<br>resident to memo<br>are available, an<br>eds to be allocate | that is currently<br>ry when no more<br>d a new live-<br>ed one spill. |           |
| • Minimizin<br>optimisti<br>the varia<br>minimizi                       | g Execution Cost:<br>c assignment— i.<br>bles are register-<br>ng spilling.      | Given an<br>.e., one where all<br>resident,                            |           |
|                                                                         |                                                                                  |                                                                        |           |
| CS 211                                                                  |                                                                                  |                                                                        |           |



heuristics for coloring.



# Register Allocation as Coloring

- Given *k* registers, interpret each register as a color.
- The graph *G* is the interference graph of the given program.
- The nodes of the interference graph are the executable live ranges on the target platform.
- A coloring of the interference graph is an assignment of registers (colors) to live ranges (nodes).
- Running out of colors implies not enough registers and hence a need to spill in the above model.

















# **Important Modeling Difference**

- A second major difference is the granularity at which code is modeled.
  - In the classical approach, individual instructions are modeled whereas
  - Now, basic blocks are the primitive units modeled as nodes in live ranges and the interference graph.
- The final major difference is the place of the register allocation in the overall compilation process.
  - In the present approach, the interference graph is considered earlier in the compilation process using intermediate level statements; compiler generated temporaries are known.
  - In contrast, in the previous work the allocation is done at the level of the machine code.

CS 211

# The Main Information to be Used by the Register Allocator

- For each live range, we have a bit vector *LIVE* of the basic blocks in it.
- Also we have *INTERFERE* which gives for the live range, the set of all other live ranges that interfere with it.
- Recall that two live ranges interfere if they intersect in at least one (basic-block).
- If *INTERFERE* is smaller than the number of available of registers for a node *i*, then *i* is *unconstrained*; it is constrained otherwise.

more...

### The Main Information to be Used by the Register Allocator

- An unconstrained node can be safely assigned a register since conflicting live ranges do not use up the available registers.
- We associate a (possibly empty) set FORBIDDEN with each live range that represents the set of colors that have already been assigned to the members of its INTERFERENCE set.

The above representation is essentially a detailed interference graph representation.

CS 211

# Prioritizing Live Ranges In the memory bound approach, given live ranges with a choice of assigning registers, we do the following: Choose a live range that is "likely" to yield greater savings in execution time. This means that we need to estimate the savings of each basic block in a live range.

|                                                                   | Estimate the Savings                                                                                                   |
|-------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------|
| Given a live range<br>savings in a basi                           | e X for variable x, the estimated c block i is determined as follows:                                                  |
| 1. First compute<br>loads and store<br>of cycles taken            | <i>CyclesSaved</i> which is the number of<br>ed of <i>x</i> in <i>i</i> scaled by the number<br>n for each load/store. |
| 2. Compensate the might be needed store the varial <i>Setup</i> . | ne single load and/or store that<br>ed to bring the variable in and/or<br>ble at the end and denote it by              |
| Note that Setup store or a load                                   | is derived from a single load or plus a store.                                                                         |
|                                                                   | more                                                                                                                   |







# The Algorithm (Contd.) 3. For each live range X' that is in INTERFERE for X (a) If the FORBIDDEN of X' is the set of all colors

i.e., if no colors are available, SPLIT (X'). Procedure SPLIT breaks a live range into smaller

do:

live ranges with the intent of reducing the interference of X' it will be described next.

4. Repeat the above steps till all constrained live ranges are colored or till there is no color left to color any basic block. CS 211



# The Idea Behind Splitting

- · Splitting ensures that we break a live range up into increasingly smaller live ranges.
- The limit is of course when we are down to the size of a single basic block.
- The intuition is that we start out with coarsegrained interference graphs with few nodes.
- This makes the interference node degree possibly high.
- We increase the problem size via splitting on a need-to basis.
- This strategy lowers the interference. •



![](_page_37_Figure_1.jpeg)

![](_page_37_Figure_2.jpeg)

![](_page_37_Figure_3.jpeg)

![](_page_38_Figure_0.jpeg)

Contrast with instruction scheduling.

• Factoring in register allocation into scheduling and solving the problem "globally" is a research <sup>CS</sup> Issue.

![](_page_38_Figure_3.jpeg)

# EPIC and Compiler Optimization

- EPIC requires dependency free "scheduled code"
- Burden of extracting parallelism falls on compiler
- success of EPIC architectures depends on efficiency of Compilers!!
- We provide overview of Compiler Optimization techniques (as they apply to EPIC/ILP)
  - enhanced by examples using Trimaran ILP Infrastructure
- CS 211

![](_page_38_Figure_11.jpeg)

# Scheduling for ILP Processors

- Size of basic block limits amount of ILP that can be extracted
- More than one basic block = going beyond branches
  - Loop optimizations also
  - Trace scheduling

    Pick a trace in the program graph
    Most frequently executed region of code
- Region based scheduling
   Find a region of code, and send this to the scheduler/register allocator

![](_page_39_Figure_0.jpeg)

![](_page_39_Figure_1.jpeg)

![](_page_39_Figure_2.jpeg)

![](_page_39_Figure_3.jpeg)

![](_page_40_Figure_0.jpeg)

![](_page_40_Figure_1.jpeg)

![](_page_40_Figure_2.jpeg)

![](_page_40_Figure_3.jpeg)

![](_page_41_Figure_0.jpeg)

![](_page_41_Figure_1.jpeg)

![](_page_41_Figure_2.jpeg)

![](_page_41_Figure_3.jpeg)

![](_page_42_Figure_0.jpeg)

![](_page_42_Figure_1.jpeg)

![](_page_42_Figure_2.jpeg)

![](_page_42_Figure_3.jpeg)

![](_page_42_Figure_4.jpeg)

![](_page_43_Figure_0.jpeg)

![](_page_43_Figure_1.jpeg)

![](_page_43_Figure_2.jpeg)

![](_page_43_Figure_3.jpeg)

![](_page_44_Figure_0.jpeg)

![](_page_44_Figure_1.jpeg)

![](_page_44_Figure_2.jpeg)

![](_page_44_Figure_3.jpeg)

![](_page_45_Figure_0.jpeg)

![](_page_45_Figure_2.jpeg)

![](_page_45_Figure_3.jpeg)

![](_page_45_Picture_4.jpeg)

![](_page_46_Figure_0.jpeg)

![](_page_46_Figure_1.jpeg)