

| 6      | Course Objectives: Where are we? |
|--------|----------------------------------|
|        |                                  |
|        |                                  |
|        |                                  |
|        |                                  |
|        |                                  |
| CS 135 |                                  |















| 6 | Memory Hierarchies                                                                                                                             |
|---|------------------------------------------------------------------------------------------------------------------------------------------------|
| • | Key Principles                                                                                                                                 |
|   | <ul> <li>Locality – most programs do not access code or data<br/>uniformly</li> </ul>                                                          |
|   | Smaller hardware is faster                                                                                                                     |
| • | Goal                                                                                                                                           |
|   | Design a memory hierarchy "with cost almost as low<br>as the cheapest level of the hierarchy and speed<br>almost as fast as the fastest level" |
|   | > This implies that we be clever about keeping more likely<br>used data as "close" to the CPU as possible                                      |
| • | Levels provide subsets                                                                                                                         |
|   | Anything (data) found in a particular level is also found<br>in the next level below.                                                          |
|   | <ul> <li>Each level maps from a slower, larger memory to a<br/>smaller but faster memory.</li> </ul>                                           |

























| S                  | Cache DesignQuestions                              | 6  | <u>Where can a bloc</u>                                                     |
|--------------------|----------------------------------------------------|----|-----------------------------------------------------------------------------|
| • Q1<br>the        | : Where can a block be placed in<br>e upper level? | •  | 3 schemes for blo<br>cache:                                                 |
| >                  | block placement                                    |    | Direct mapped cache                                                         |
| • Q2<br>upp<br>> t | : How is a block found if it is in the ber level?  |    | <ul> <li>Block (or data to be cache</li> <li>Usually: (Block add</li> </ul> |
| • Q3<br>a r        | 8: Which block should be replaced on niss?         |    | <ul> <li>Fully associative cac</li> <li>Block can be place</li> </ul>       |
| *                  | block replacement                                  |    | Set associative cache                                                       |
| • Q4               | l: What happens on a write?                        |    | Set" = a group of b                                                         |
| > \                | Write strategy                                     |    | <ul> <li>Block mapped onto<br/>anywhere within the</li> </ul>               |
|                    |                                                    |    | > Usually: (Block add                                                       |
| CS 135             |                                                    | CS | 135 > If n blocks in a set,                                                 |



- Set = a group of blocks in the cache
- Block mapped onto a set & then block can be placed anywhere within that set
- Usually: (Block address) MOD (# of sets in the cache)
- If n blocks in a set, we call it n-way set associative



























### Which block should be replaced on a <u>cache miss?</u>

- If we look something up in cache and entry not there, generally want to get data from memory and put it in cache
  - > B/c principle of locality says we'll probably use it again
- <u>Direct mapped</u> caches have 1 choice of what block to replace
- <u>Fully associative</u> or <u>set</u> <u>associative</u> offer more choices
- Usually 2 strategies:
  - Random pick any possible block and replace it
  - > LRU stands for "Least Recently Used"
    - > Why not throw out the block not used for the longest time
  - Usually approximated, not much better than random i.e. 5.18% vs.
     5.69% for 16KB 2-way set associative













| Write Policies: Analysis                                                                                      | Modeling Cache Performance                                                                |
|---------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Write through                                                                                                 | <ul> <li>CPU time equationagain!</li> </ul>                                               |
| > Simple                                                                                                      |                                                                                           |
| <ul> <li>Correctness easily maintained and no ambiguity about<br/>which copy of a block is current</li> </ul> | <ul> <li>CPU execution time =</li> </ul>                                                  |
| Drawback is bandwidth required; memory access time                                                            | (CPU clk cycles + Memory stall cy                                                         |
| <ul> <li>Must also decide on decision to fetch and allocate<br/>space for block to be written</li> </ul>      | clk cycle time.                                                                           |
| Write allocate: fetch such a block and put in cache                                                           |                                                                                           |
| Write-no-allocate: avoid fetch, and install blocks only on<br>read misses                                     | <ul> <li>Memory stall cycles =</li> </ul>                                                 |
| <ul> <li>Good for cases of streaming writes which overwrite data</li> </ul>                                   | number of misses * miss penalty<br>IC*(memory accesses/instruction)<br>rate* miss penalty |
| S 135                                                                                                         | CS 135                                                                                    |

# s + Memory stall cycles) \*

cycles = sses \* miss penalty = accesses/instruction)\*miss nalty





| S  | Memory stall cycles                                                                                                                                                                                 |
|----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| •  | Memory stall cycles: number of cycles<br>that processor is stalled waiting for<br>memory access                                                                                                     |
| •  | Performance in terms of mem stall<br>cycles<br>> CPU = (CPU cycles + Mem stall cycles)*Clk cycle time<br>> Mem stall cycles = number of misses * miss penalty<br>= IC *(Misses/Inst) * Miss Penalty |
|    | <ul> <li>IC * (Mem accesses/Inst) * Miss Rate * penalty</li> <li>Note: Read and Write misses combined into one miss rate</li> </ul>                                                                 |
| CS | 135                                                                                                                                                                                                 |

















| 6  | <u>Next: How to Improve Cache</u><br><u>Performance?</u> |
|----|----------------------------------------------------------|
|    |                                                          |
|    | AMAT = HitTime + MissRate × MissPenalty                  |
| 1. | Reduce the miss rate,                                    |
| 2. | Reduce the miss penalty, or                              |

3. Reduce the time to hit in the cache.

CS 135

S

Appendix C: Basic Cache Concepts Chapter 5: Cache Optimizations

Project 2: Study performance of benchmarks (project 1 benchmarks) using different cache organizations

CS 135































## **Improving Cache Performance**

- 1. <u>Reduce the miss rate</u>,
- 2. Reduce the miss penalty\_or
- 3. Reduce the time to hit in the cache.

CS 135

3













| ) | <u>La</u>                                      | <b>rger c</b> a                                    | <b>ache b</b>                                   | <b>olock s</b>                               | <b>ize (ex</b>                          | ample                                      | <u>e)</u>                 |
|---|------------------------------------------------|----------------------------------------------------|-------------------------------------------------|----------------------------------------------|-----------------------------------------|--------------------------------------------|---------------------------|
| • | Assume th<br>you:<br>> Incur a 4<br>> Get 16 b | at to ac<br>0 clock cyc<br>ytes of data<br>6 bytes | cle overhea<br>a every 2 c<br>in 42 cl          | ver-level<br>ad<br>clock cycle<br>lock cycle | of mem<br>s<br>es, 32 in                | ory hier<br>n 44, et                       | archy<br>c                |
| • | Using date<br>memory a                         | a below,<br>ccess tin                              | which b<br>ne?                                  | lock size                                    | : has mir                               | nimum av                                   | Cache sizes               |
| • | Using date<br>memory ad                        | a below,<br>ccess tin                              | which b<br>ne?                                  | lock size                                    | has mir                                 | 256K *                                     | Cache sizes               |
| • | Using date<br>memory ad                        | a below,<br>ccess tin                              | which b<br>ne?<br><u>4K</u><br>8 57%            | lock size                                    | 64K                                     | 256K *                                     | Cache sizes               |
| • | Block Size                                     | 1K<br>15.05%<br>13.34%                             | which b<br>ne?<br>4K<br>8.57%<br>7 24%          | lock size                                    | 64K<br>2.04%                            | 256K ×<br>1.09%                            | Cache sizes<br>Miss rates |
| • | Block Size                                     | 1k<br>15.05%<br>13.34%                             | which b<br>he?<br>4K<br>8.57%<br>7.24%<br>7.00% | 16K<br>3.94%<br>2.87%<br>2.64%               | 64K<br>2.04%<br>1.35%<br>1.06%          | 256K ×<br>1.09%<br>0.70%<br>0.51%          | Cache sizes               |
| • | Block Size<br>16<br>32<br>64<br>128            | 1K<br>15.05%<br>13.34%<br>13.76%<br>16.64%         | 4K<br>8.57%<br>7.24%<br>7.00%<br>7.78%          | 16K<br>3.94%<br>2.87%<br>2.64%<br>2.77%      | 64K<br>2.04%<br>1.35%<br>1.06%<br>1.02% | 256K ×<br>1.09%<br>0.70%<br>0.51%<br>0.49% | Cache sizes               |



| D                              |                                                  | Ľ                                            | <u>arger</u>                                       | <u>cacne</u><br>(ex. co              | <u>ntinue</u>                        | <u>ed)</u>                        |                       |          |
|--------------------------------|--------------------------------------------------|----------------------------------------------|----------------------------------------------------|--------------------------------------|--------------------------------------|-----------------------------------|-----------------------|----------|
|                                |                                                  |                                              |                                                    |                                      |                                      |                                   |                       | Cache si |
|                                | Block<br>Size                                    | Miss<br>Penalt                               | 1K                                                 | 4K                                   | 16K                                  | 64K                               | 256K                  | ]        |
|                                | 16                                               | 4/2                                          | 7.321                                              | 4.599                                | 2.655                                | 1.857                             | 1.485                 |          |
|                                | 32                                               | 44                                           | 6.870                                              | 4.186                                | 2.263                                | 1.594                             | 1.308                 |          |
|                                | 64                                               | 48                                           | 7.605                                              | 4.360                                | 2.267                                | 1.509                             | 1.245                 | 1        |
|                                | 128                                              | 56                                           | 10.318                                             | 5.357                                | 2.551                                | 1.571                             | 1.274                 |          |
|                                | 256                                              | 72                                           | 16.847                                             | 7.847                                | 3.369                                | 1.828                             | 1.353                 |          |
| <mark>Red</mark><br>Not<br>Not | <mark>entries a</mark><br>e: All of<br>e: Data t | <mark>re lowes</mark><br>these b<br>for cach | <mark>t average</mark><br>lock sizes<br>e sizes in | e time fo<br>s are com<br>s units of | r a partie<br>mon in pi<br>"clock cy | cular con<br>rocessor':<br>ycles" | figuratior<br>s today | •        |



















| 6 | Victim caches                                                                                                                                                     |
|---|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | 1 <sup>st</sup> of all, what is a "victim cache"?                                                                                                                 |
|   | <ul> <li>A victim cache temporarily stores blocks that have<br/>been discarded from the main cache (usually not that<br/>big) – due to conflict misses</li> </ul> |
| • | 2 <sup>nd</sup> of all, how does it help us?                                                                                                                      |
|   | <ul> <li>If there's a cache miss, instead of immediately going<br/>down to the next level of memory hierarchy we check<br/>the victim cache first</li> </ul>      |
|   | If the entry is there, we swap the victim cache block<br>with the actual cache block                                                                              |
| • | Research shows:                                                                                                                                                   |
|   | > Victim caches with 1-5 entries help reduce conflict<br>misses                                                                                                   |
|   | Eor a /KB direct manned cache victim caches:                                                                                                                      |

- For a 4KB direct mapped cache victim caches:
- CS 135 Removed 20% 95% of conflict misses!





















# Compiler-controlled prefetching It's also possible for the compiler to tell the hardware that it should prefetch instructions or data It (the compiler) could have values loaded into registers - called register prefetching Or, the compiler could just have data loaded into the cache - called cache prefetching getting things from lower levels of memory can cause faults - if the data is not there... Ideally, we want prefetching to be "invisible" to the program; so often, nonbinding/nonfaulting prefetching used With nonfaulting scheme, faulting instructions turned into no-

> With "faulting" scheme, data would be fetched (as "normal")

CS 135

9

### Reducing Misses by Compiler Optimizations

- McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, 4 byte blocks in software
- Instructions
  - Reorder procedures in memory so as to reduce conflict misses
  - > Profiling to look at conflicts(using tools they developed)
- Data

R

- Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
- Loop Interchange: change nesting of loops to access data in order stored in memory
- Loop Fusion: Combine 2 independent loops that have same looping and some variables overlap
- > Blocking: Improve temporal locality by accessing "blocks" of data cs repeatedly vs. going down whole columns or rows















**→** 











- This problem centers around virtual addresses. Should we send the virtual address to the cache?
  - > In other words we have Virtual caches vs. Physical caches
  - Why is this a problem anyhow?
    - Well, recall from OS that a processor usually deals with processes
    - What if process 1 uses a virtual address xyz and process 2 uses the same virtual address?
    - The data in the cache would be totally different! called aliasing
  - aliasing
- Every time a process is switched logically, we'd have to flush the cache or we'd get false hits.
  - Cost = time to flush + compulsory misses from empty cache
- I/O must interact with caches so we need

cs 13 virtual addressess











| 6  | <u>Cache Summary</u>                                                                                 |
|----|------------------------------------------------------------------------------------------------------|
| •  | Cache performance crucial to overall performance                                                     |
| •  | <ul> <li>Optimize performance</li> <li>Miss rates</li> <li>Miss penalty</li> <li>Hit time</li> </ul> |
| •  | Software optimizations can lead to<br>improved performance                                           |
| •  | Next Code Optimization in<br>Compilers                                                               |
| CS | 135                                                                                                  |