Appearance
CPU Cache Hierarchy and Prefetching
Video: Memory, Cache Locality, and why Arrays are Fast (Data Structures and Optimization)
Modern CPUs can execute billions of instructions per second, but there's a massive speed gap between the CPU and main memory (RAM). While the CPU can execute an instruction in 1-3 cycles, accessing main memory takes 100-300 cycles. This is known as the memory wallThe performance gap between CPU speed and memory access speed - the CPU is constantly waiting for data from memory.
CPUs employ caches to solve this problem. Caches are a series of increasingly larger but slower memory levels that store frequently accessed data closer to the CPU.
The Cache Hierarchy
L1 Cache (Level 1)
- Size: 32-64KB per core
- Latency: 1-2 cycles
- Location: On the CPU die, closest to the execution units
- Split Design: L1 is divided into two separate caches:
- L1 Instruction Cache (L1i): Stores program instructions
- L1 Data Cache (L1d): Stores data values
The L1 cache is the fastest but smallest cache level. It's designed for maximum speed, which is why it's split into instruction and data caches - this allows the CPU to fetch instructions and data simultaneously.
L2 Cache (Level 2)
- Size: 256KB-1MB per core
- Latency: 10-20 cycles
- Location: On the CPU die, but further from execution units
- Unified: Single cache for both instructions and data
L2 cache serves as a middle ground - larger than L1 but slower. It catches most of the misses from L1 cache, reducing the need to go to the much slower L3 cache or main memory.
L3 Cache (Level 3)
- Size: 8-32MB total
- Latency: 40-80 cycles
- Location: On the CPU die, shared among all cores
- Shared: All CPU cores share the same L3 cache
L3 cache is the largest on-chip cache and is shared among all CPU cores. This helps reduce memory bandwidth pressure and allows cores to share data efficiently.
Cache Lines and Spatial Locality
When the CPU requests a single byte from memory, the cache doesn't just load that one byte. Instead, it loads an entire cache lineA fixed-size block of memory that is transferred as a unit between cache and main memory (typically 64 bytes) containing the requested byte. This is also quite performant because programs have a tendency to access data that is physically close in memory. This is called spatial locality.
For example;
- Programs usually access instructions in a sequantial manner. It would be useful to load the next instruction into the cache before it is needed.
- When you access
array[0], the cache loads the entire cache line containingarray[0]througharray[15](assuming 4-byte integers). This means accessingarray[1],array[2], etc. will be cache hits.
Temporal Locality and Cache Replacement
Programs also have a tendency to access the same data repeatedly over time. This is called temporal locality.
When the cache is full and new data needs to be loaded, the cache must evict some existing data. Most caches use a LRULeast Recently Used - a cache replacement policy that evicts the data that hasn't been accessed for the longest time.
Hardware Prefetching
Modern CPUs don't just wait for explicit memory requests. They use hardware prefetching to predict which data will be needed next and load it into cache before it's actually requested.
Types of Prefetching
- Sequential Prefetching: When the CPU detects sequential memory access patterns (like iterating through an array), it automatically loads the next cache lines.
- Stride Prefetching: For regular access patterns with a constant stride (like accessing every 4th element), the CPU predicts future addresses.
- Instruction Prefetching: The CPU prefetches instructions that are likely to be executed next, based on branch predictionA technique that predicts which part of an if-else block will be executed.
Cache Misses and Performance Impact
A cache miss occurs when the CPU requests data that isn't in the cache. This forces the CPU to:
- Check the next cache level (L2, L3)
- If not found, access main memory
- Load the data into cache
- Resume execution
Cache misses are expensive because they cause the CPU to stall while waiting for data. The performance impact depends on where the data is found:
- L2 hit: ~10-20 cycle penalty
- L3 hit: ~40-80 cycle penalty
- Main memory: ~100-300 cycle penalty
Write Policies
Caches use different policies for handling writes:
Write-Through
- Writes are immediately sent to main memory
- Ensures data consistency but reduces performance
- Used in L1 instruction cache (instructions are rarely modified)
Write-Back
- Writes are initially stored only in cache
- Data is written to main memory later when the cache line is evicted
- Improves performance by reducing memory bandwidth
- Used in L1 and L2 data caches
- Makes multi-core systems more complex and data consistency becomes harder to maintain.
Cache Coherence
In multi-core systems, multiple cores may have copies of the same data in their caches. Cache coherence protocols ensure that all cores see a consistent view of memory by:
- Invalidating stale cache lines when data is modified
- Broadcasting updates to all cores
- Using protocols like MESI (Modified, Exclusive, Shared, Invalid)
C++ has a few ways to control cache behavior; We'll cover them in the low latency C++ track.
Performance Optimization
Understanding cache behavior is crucial for writing high-performance code:
Optimize for Spatial Locality
cpp
// Good: Sequential access
for (int i = 0; i < N; i++) {
sum += array[i];
}
// Bad: Random access
for (int i = 0; i < N; i++) {
sum += array[random_indices[i]];
}Optimize for Temporal Locality
cpp
// Good: Reuse data in cache
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
result[i] += matrix[i][j] * vector[j];
}
}
// Bad: Poor cache reuse
for (int j = 0; j < N; j++) {
for (int i = 0; i < N; i++) {
result[i] += matrix[i][j] * vector[j];
}
}Cache-Friendly Data Structures
- Use arrays instead of linked lists when possible
- Structure data to minimize cache misses
- Consider cache line size when designing data layouts
Real-World Impact
Cache performance is critical in high-frequency trading and systems programming:
- Latency-sensitive applications: Every cache miss adds microseconds of latency
- Throughput optimization: Better cache utilization means higher throughput
- Memory bandwidth: Efficient cache usage reduces memory bandwidth pressure
- Power efficiency: Cache hits use less power than memory accesses
Understanding cache behavior helps you write code that works with the hardware rather than against it, leading to significant performance improvements in critical applications.
Questions
Q: Which cache level is typically the fastest but smallest?
L1 cache is the fastest and smallest cache level, typically 32-64KB per core with 1-2 cycle access latency. It's split into L1 instruction cache (L1i) and L1 data cache (L1d) to optimize for different access patterns.
Q: What is the main purpose of the L1 instruction cache?
The L1 instruction cache (L1i) stores recently fetched program instructions to avoid repeated memory accesses. It's optimized for sequential instruction access patterns and helps maintain high instruction throughput.
Q: Which cache level is typically shared among all CPU cores?
L3 cache is typically shared among all CPU cores, providing a large shared cache (8-32MB) with higher latency (40-80 cycles) but much larger capacity than L1/L2 caches. This helps reduce memory bandwidth pressure.
Q: What is hardware prefetching?
Hardware prefetching is when the CPU automatically predicts which data will be needed next and loads it into cache before it's actually requested. This reduces cache misses and improves performance for predictable access patterns.
Q: What is a cache line?
A cache line is a fixed-size block of memory (typically 64 bytes) that is transferred as a unit between cache and main memory. When the CPU requests a single byte, the entire cache line containing that byte is loaded into cache.
Q: What happens during a cache miss?
During a cache miss, the CPU must fetch the requested data from a lower cache level (L2, L3) or main memory. This causes a stall while waiting for the data, which is why cache misses significantly impact performance.
Q: What is spatial locality?
Spatial locality refers to the tendency to access data that is physically close in memory. This is why cache lines work well - when you access one element of an array, nearby elements are likely to be accessed soon.
Q: What is temporal locality?
Temporal locality refers to the tendency to access the same data repeatedly over time. This is why recently accessed data stays in cache - it's likely to be accessed again soon.
Q: What is the typical size of L2 cache per core?
L2 cache is typically 256KB to 1MB per core with 10-20 cycle access latency. It's larger than L1 but slower, serving as a middle ground between the fast L1 cache and the larger but slower L3 cache.
Q: What is a write-back cache policy?
In a write-back policy, writes are initially stored only in cache. The data is written to main memory later when the cache line is evicted or explicitly flushed. This reduces memory bandwidth and improves performance.