CPU Cache Hierarchy and Prefetching

Video: Memory, Cache Locality, and why Arrays are Fast (Data Structures and Optimization)

Modern CPUs can execute billions of instructions per second, but there's a massive speed gap between the CPU and main memory (RAM). While the CPU can execute an instruction in 1-3 cycles, accessing main memory takes 100-300 cycles. This is known as the memory wallThe performance gap between CPU speed and memory access speed - the CPU is constantly waiting for data from memory.

CPUs employ caches to solve this problem. Caches are a series of increasingly larger but slower memory levels that store frequently accessed data closer to the CPU.

The Cache Hierarchy

L1 Cache (Level 1)

Size: 32-64KB per core
Latency: 1-2 cycles
Location: On the CPU die, closest to the execution units
Split Design: L1 is divided into two separate caches:
- L1 Instruction Cache (L1i): Stores program instructions
- L1 Data Cache (L1d): Stores data values

The L1 cache is the fastest but smallest cache level. It's designed for maximum speed, which is why it's split into instruction and data caches - this allows the CPU to fetch instructions and data simultaneously.

L2 Cache (Level 2)

Size: 256KB-1MB per core
Latency: 10-20 cycles
Location: On the CPU die, but further from execution units
Unified: Single cache for both instructions and data

L2 cache serves as a middle ground - larger than L1 but slower. It catches most of the misses from L1 cache, reducing the need to go to the much slower L3 cache or main memory.

L3 Cache (Level 3)

Size: 8-32MB total
Latency: 40-80 cycles
Location: On the CPU die, shared among all cores
Shared: All CPU cores share the same L3 cache

L3 cache is the largest on-chip cache and is shared among all CPU cores. This helps reduce memory bandwidth pressure and allows cores to share data efficiently.

Cache Lines and Spatial Locality

When the CPU requests a single byte from memory, the cache doesn't just load that one byte. Instead, it loads an entire cache lineA fixed-size block of memory that is transferred as a unit between cache and main memory (typically 64 bytes) containing the requested byte. This is also quite performant because programs have a tendency to access data that is physically close in memory. This is called spatial locality.

For example;

Programs usually access instructions in a sequantial manner. It would be useful to load the next instruction into the cache before it is needed.
When you access array[0], the cache loads the entire cache line containing array[0] through array[15] (assuming 4-byte integers). This means accessing array[1], array[2], etc. will be cache hits.

Temporal Locality and Cache Replacement

Programs also have a tendency to access the same data repeatedly over time. This is called temporal locality.

When the cache is full and new data needs to be loaded, the cache must evict some existing data. Most caches use a LRULeast Recently Used - a cache replacement policy that evicts the data that hasn't been accessed for the longest time.

Hardware Prefetching

Modern CPUs don't just wait for explicit memory requests. They use hardware prefetching to predict which data will be needed next and load it into cache before it's actually requested.

Types of Prefetching

Sequential Prefetching: When the CPU detects sequential memory access patterns (like iterating through an array), it automatically loads the next cache lines.
Stride Prefetching: For regular access patterns with a constant stride (like accessing every 4th element), the CPU predicts future addresses.
Instruction Prefetching: The CPU prefetches instructions that are likely to be executed next, based on branch predictionA technique that predicts which part of an if-else block will be executed.

Cache Misses and Performance Impact

A cache miss occurs when the CPU requests data that isn't in the cache. This forces the CPU to:

Check the next cache level (L2, L3)
If not found, access main memory
Load the data into cache
Resume execution

Cache misses are expensive because they cause the CPU to stall while waiting for data. The performance impact depends on where the data is found:

L2 hit: ~10-20 cycle penalty
L3 hit: ~40-80 cycle penalty
Main memory: ~100-300 cycle penalty

Write Policies

Caches use different policies for handling writes:

Write-Through

Writes are immediately sent to main memory
Ensures data consistency but reduces performance
Used in L1 instruction cache (instructions are rarely modified)

Write-Back

Writes are initially stored only in cache
Data is written to main memory later when the cache line is evicted
Improves performance by reducing memory bandwidth
Used in L1 and L2 data caches
Makes multi-core systems more complex and data consistency becomes harder to maintain.

Cache Coherence

In multi-core systems, multiple cores may have copies of the same data in their caches. Cache coherence protocols ensure that all cores see a consistent view of memory by:

Invalidating stale cache lines when data is modified
Broadcasting updates to all cores
Using protocols like MESI (Modified, Exclusive, Shared, Invalid)

C++ has a few ways to control cache behavior; We'll cover them in the low latency C++ track.

Performance Optimization

Understanding cache behavior is crucial for writing high-performance code:

Optimize for Spatial Locality

cpp

// Good: Sequential access
for (int i = 0; i < N; i++) {
    sum += array[i];
}

// Bad: Random access
for (int i = 0; i < N; i++) {
    sum += array[random_indices[i]];
}

Optimize for Temporal Locality

cpp

// Good: Reuse data in cache
for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        result[i] += matrix[i][j] * vector[j];
    }
}

// Bad: Poor cache reuse
for (int j = 0; j < N; j++) {
    for (int i = 0; i < N; i++) {
        result[i] += matrix[i][j] * vector[j];
    }
}

Cache-Friendly Data Structures

Use arrays instead of linked lists when possible
Structure data to minimize cache misses
Consider cache line size when designing data layouts

Real-World Impact

Cache performance is critical in high-frequency trading and systems programming:

Latency-sensitive applications: Every cache miss adds microseconds of latency
Throughput optimization: Better cache utilization means higher throughput
Memory bandwidth: Efficient cache usage reduces memory bandwidth pressure
Power efficiency: Cache hits use less power than memory accesses

Understanding cache behavior helps you write code that works with the hardware rather than against it, leading to significant performance improvements in critical applications.

Questions

Q: Which cache level is typically the fastest but smallest?

L1 cache is the fastest and smallest cache level, typically 32-64KB per core with 1-2 cycle access latency. It's split into L1 instruction cache (L1i) and L1 data cache (L1d) to optimize for different access patterns.

Q: What is the main purpose of the L1 instruction cache?

The L1 instruction cache (L1i) stores recently fetched program instructions to avoid repeated memory accesses. It's optimized for sequential instruction access patterns and helps maintain high instruction throughput.

Q: Which cache level is typically shared among all CPU cores?

L3 cache is typically shared among all CPU cores, providing a large shared cache (8-32MB) with higher latency (40-80 cycles) but much larger capacity than L1/L2 caches. This helps reduce memory bandwidth pressure.

Q: What is hardware prefetching?

Hardware prefetching is when the CPU automatically predicts which data will be needed next and loads it into cache before it's actually requested. This reduces cache misses and improves performance for predictable access patterns.

Q: What is a cache line?

A cache line is a fixed-size block of memory (typically 64 bytes) that is transferred as a unit between cache and main memory. When the CPU requests a single byte, the entire cache line containing that byte is loaded into cache.

Q: What happens during a cache miss?

During a cache miss, the CPU must fetch the requested data from a lower cache level (L2, L3) or main memory. This causes a stall while waiting for the data, which is why cache misses significantly impact performance.

Q: What is spatial locality?

Spatial locality refers to the tendency to access data that is physically close in memory. This is why cache lines work well - when you access one element of an array, nearby elements are likely to be accessed soon.

Q: What is temporal locality?

Temporal locality refers to the tendency to access the same data repeatedly over time. This is why recently accessed data stays in cache - it's likely to be accessed again soon.

Q: What is the typical size of L2 cache per core?

L2 cache is typically 256KB to 1MB per core with 10-20 cycle access latency. It's larger than L1 but slower, serving as a middle ground between the fast L1 cache and the larger but slower L3 cache.

Q: What is a write-back cache policy?

In a write-back policy, writes are initially stored only in cache. The data is written to main memory later when the cache line is evicted or explicitly flushed. This reduces memory bandwidth and improves performance.

CPU Cache Hierarchy and Prefetching ​

The Cache Hierarchy ​

L1 Cache (Level 1) ​

L2 Cache (Level 2) ​

L3 Cache (Level 3) ​

Cache Lines and Spatial Locality ​

Temporal Locality and Cache Replacement ​

Hardware Prefetching ​

Types of Prefetching ​

Cache Misses and Performance Impact ​

Write Policies ​

Write-Through ​

Write-Back ​

Cache Coherence ​

Performance Optimization ​

Optimize for Spatial Locality ​

Optimize for Temporal Locality ​

Cache-Friendly Data Structures ​

Real-World Impact ​

Questions ​