Appearance
Instruction Reordering and Out-of-Order Execution
Not every CPU instruction takes the same amount of cycles. Floating point operations like mul and div are very expensive as opposed to operations like lshift or rshift. Further more, some operations may require accessing memory and stall subsequent instructions. Instruction reordering solves this exact dilemma - what if we could process independent instructions while waiting for a more expensive operation to complete?
In a simple in-order processor, instructions are executed exactly in the order they appear in the program. This creates a significant performance bottleneck because:
- Long-latency operations (like memory accesses) block subsequent instructions
- Independent instructions cannot execute in parallel
- CPU resources remain idle while waiting for slow operations
For example, consider this code:
cpp
int a = load_from_memory(); // 100 cycles
int b = a + 1; // 1 cycle (waits for a)
int c = 5 * 2; // 1 cycle (independent, but waits)
int d = b + c; // 1 cycle (waits for b and c)In an in-order processor, this takes 103 cycles because each instruction waits for the previous one.
The Solution: Out-of-Order Execution
Out-of-order execution allows the CPU to execute instructions in a different order than they appear in the program, as long as dependencies are respected. This enables:
- Parallel execution of independent instructions
- Better resource utilization by keeping execution units busy
- Hiding latency of slow operations
The same code in an out-of-order processor might execute like this:
cpp
int a = load_from_memory(); // 100 cycles (starts first)
int c = 5 * 2; // 1 cycle (executes in parallel with load)
int b = a + 1; // 1 cycle (executes when a is ready)
int d = b + c; // 1 cycle (executes when b and c are ready)Total time: ~101 cycles instead of 103 cycles.
How Out-of-Order Execution Works
The Tomasulo Algorithm
Modern out-of-order processors use variations of the Tomasulo algorithm, which consists of several key components:
1. Register Renaming
Register renaming eliminates false dependencies by dynamically mapping architectural registers (visible to the program) to physical registers (actual hardware registers).
cpp
// Original code with false dependency
r1 = load(x); // r1 = architectural register
r1 = r1 + 1; // This appears to depend on the load
r2 = r1 * 2; // This depends on the add
// After register renaming
p1 = load(x); // p1 = physical register
p2 = p1 + 1; // p2 = different physical register
p3 = p2 * 2; // p3 = different physical register2. Reservation Stations
Reservation stations hold instructions that are ready to execute (all operands available) and dispatch them to execution units when the units become available.
cpp
Reservation Station 1: [ADD p2 = p1 + 1] (waiting for p1)
Reservation Station 2: [MUL p3 = p2 * 2] (waiting for p2)
Reservation Station 3: [LOAD p4 = y] (ready to execute)3. Reorder Buffer (ROB)
The reorder buffer ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.
cpp
ROB Entry 1: [LOAD p1 = x] (completed, ready to retire)
ROB Entry 2: [ADD p2 = p1 + 1] (completed, waiting for entry 1)
ROB Entry 3: [MUL p3 = p2 * 2] (completed, waiting for entry 2)Execution Flow
- Fetch: Instructions are fetched from memory
- Decode: Instructions are decoded and dependencies identified
- Rename: Architectural registers are renamed to physical registers
- Dispatch: Instructions are sent to reservation stations
- Execute: Instructions execute when operands are available
- Writeback: Results are written to physical registers
- Retire: Instructions are retired in program order
Data Dependencies
Types of Dependencies
1. Data Dependencies (True Dependencies)
When one instruction depends on the result of another instruction:
cpp
int a = load(x); // Instruction A
int b = a + 1; // Instruction B depends on A2. Anti-Dependencies
When one instruction writes to a register that a later instruction reads:
cpp
int a = 5; // Instruction A writes to a
int b = a + 1; // Instruction B reads from a
int a = 10; // Instruction C writes to a3. Output Dependencies
When multiple instructions write to the same register:
cpp
int a = 5; // Instruction A writes to a
int a = 10; // Instruction B writes to aDependency Resolution
Register renaming eliminates anti-dependencies and output dependencies by using different physical registers:
cpp
// Original code with dependencies
int a = 5; // p1 = 5
int b = a + 1; // p2 = p1 + 1
int a = 10; // p3 = 10 (different physical register)
// After renaming, no dependencies between the add and the second assignmentMemory Ordering
The Memory Ordering Problem
In multi-threaded programs, instruction reordering can cause memory ordering problems:
cpp
// Thread 1
x = 1; // Store to x
ready = 1; // Store to ready
// Thread 2
while (!ready); // Load from ready
print(x); // Load from xWithout proper memory ordering, Thread 2 might see ready = 1 but x = 0, which violates the expected ordering.
Memory Barriers
Memory barriers are instructions that prevent certain types of instruction reordering:
1. Full Memory Barrier
Prevents all reordering across the barrier:
cpp
x = 1;
memory_barrier(); // All stores before barrier complete before loads after
y = load(z);2. Load-Load Barrier
Prevents reordering of loads:
cpp
int a = load(x);
load_load_barrier();
int b = load(y); // Cannot be reordered before load(x)3. Store-Store Barrier
Prevents reordering of stores:
cpp
x = 1;
store_store_barrier();
y = 2; // Cannot be reordered before x = 14. Load-Store Barrier
Prevents reordering of loads and stores:
cpp
int a = load(x);
load_store_barrier();
y = 2; // Cannot be reordered before load(x)Acquire Semantics
Prevents reordering of loads before the acquire operation:
cpp
int a = load(x);
acquire_barrier();
int b = load(y); // Cannot be reordered before load(x)Release Semantics
Prevents reordering of stores after the release operation:
cpp
x = 1;
release_barrier();
y = 2; // Cannot be reordered after x = 1Speculative Execution
What is Speculative Execution?
Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded.
Branch Prediction and Speculation
cpp
if (condition) {
x = expensive_calculation();
} else {
x = simple_calculation();
}The CPU might:
- Predict the branch outcome
- Speculatively execute the predicted path
- Discard results if prediction was wrong
Benefits
- Hides branch latency: Continues execution while waiting for branch resolution
- Better resource utilization: Keeps execution units busy
- Improved performance: Reduces pipeline stalls
Risks
- Wasted work: Speculative execution may be discarded
- Security vulnerabilities: Spectre and Meltdown attacks
- Power consumption: Executing unnecessary instructions
Performance Implications
When Out-of-Order Helps
Out-of-order execution is most beneficial when:
- Independent instructions are available
- Long-latency operations (memory accesses, divisions) are present
- Multiple execution units are available
- Register pressure is low
When Out-of-Order Doesn't Help
Out-of-order execution provides little benefit when:
- All instructions are dependent on each other
- Memory bandwidth is the bottleneck
- Cache misses are frequent
- Branch mispredictions are common
Real-World Impact
In high-frequency trading and systems programming:
- Latency sensitivity: Every cycle matters
- Predictable performance: Understanding reordering helps optimize
- Memory ordering: Critical for correct multi-threaded code
Hardware Implementation
Modern CPU Architecture
cpp
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Fetch Unit │───▶│ Decode/Rename │───▶│ Reservation │
│ │ │ │ │ Stations │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Reorder │◀───│ Execution │◀───│ Physical │
│ Buffer │ │ Units │ │ Registers │
└─────────────────┘ └─────────────────┘ └─────────────────┘1. Fetch Unit
- Fetches instructions from instruction cache
- Handles branch prediction
- Maintains instruction pointer
2. Decode/Rename Unit
- Decodes instructions
- Identifies dependencies
- Performs register renaming
3. Reservation Stations
- Hold ready instructions
- Dispatch to execution units
- Track operand availability
4. Execution Units
- ALU (arithmetic/logic operations)
- FPU (floating-point operations)
- LSU (load/store unit)
- Branch unit
5. Reorder Buffer
- Maintains program order
- Handles instruction retirement
- Manages exceptions
Programming Considerations
Writing Out-of-Order Friendly Code
1. Minimize Dependencies
cpp
// Bad: Many dependencies
int result = 0;
for (int i = 0; i < N; i++) {
result += array[i]; // Each iteration depends on previous
}
// Good: Fewer dependencies
int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
sum1 += array[i];
sum2 += array[i + 1]; // Independent operations
}
int result = sum1 + sum2;2. Use Memory Ordering Correctly
cpp
// Correct use of memory barriers
std::atomic<int> flag{0};
int data = 0;
// Thread 1
data = 42;
flag.store(1, std::memory_order_release);
// Thread 2
while (flag.load(std::memory_order_acquire) == 0);
assert(data == 42); // Guaranteed to be trueDebugging Out-of-Order Issues
1. Memory Ordering Bugs
- Symptoms: Inconsistent behavior in multi-threaded code
- Tools: ThreadSanitizer, Valgrind Helgrind
- Solution: Use proper memory barriers and atomic operations
2. Performance Issues
- Symptoms: Poor performance despite good algorithms
- Tools: Performance profilers, hardware counters
- Solution: Analyze instruction-level parallelism and dependencies
Questions
Q: What is out-of-order execution?
Out-of-order execution is a technique where the CPU executes instructions in a different order than they appear in the program to improve performance. The CPU can execute independent instructions in parallel while waiting for slower operations to complete.
Q: Why do modern CPUs use out-of-order execution?
Modern CPUs use out-of-order execution to make programs run faster by executing independent instructions in parallel. This allows the CPU to continue working while waiting for slow operations like memory accesses to complete.
Q: What is a data dependency?
A data dependency occurs when one instruction depends on the result of another instruction. For example, if instruction B needs the result of instruction A, then B cannot execute until A completes.
Q: What is the reorder buffer (ROB)?
The reorder buffer (ROB) stores completed instructions in program order for retirement. It ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.
Q: What is instruction retirement?
Instruction retirement is when an instruction's results are made visible to the program in program order. Even though instructions may execute out of order, they must retire in program order to maintain the illusion of sequential execution.
Q: What is a memory barrier?
A memory barrier is an instruction that prevents certain types of instruction reordering. It ensures that memory operations before the barrier complete before memory operations after the barrier begin.
Q: What is the difference between acquire and release semantics?
Acquire semantics prevent reordering of loads before the acquire operation, while release semantics prevent reordering of stores after the release operation. This is important for implementing synchronization primitives like mutexes.
Q: What is speculative execution?
Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded. This is commonly used with branch prediction.
Q: What is the purpose of the reservation station?
The reservation station stores instructions waiting to be executed. It holds instructions that are ready to execute (all their operands are available) and dispatches them to execution units when the units become available.
Q: What is register renaming?
Register renaming is dynamically mapping architectural registers (those visible to the program) to physical registers (actual hardware registers) to eliminate false dependencies. This allows more instructions to execute in parallel.