Instruction Reordering and Out-of-Order Execution

Not every CPU instruction takes the same amount of cycles. Floating point operations like mul and div are very expensive as opposed to operations like lshift or rshift. Further more, some operations may require accessing memory and stall subsequent instructions. Instruction reordering solves this exact dilemma - what if we could process independent instructions while waiting for a more expensive operation to complete?

In a simple in-order processor, instructions are executed exactly in the order they appear in the program. This creates a significant performance bottleneck because:

Long-latency operations (like memory accesses) block subsequent instructions
Independent instructions cannot execute in parallel
CPU resources remain idle while waiting for slow operations

For example, consider this code:

cpp

int a = load_from_memory();  // 100 cycles
int b = a + 1;              // 1 cycle (waits for a)
int c = 5 * 2;              // 1 cycle (independent, but waits)
int d = b + c;              // 1 cycle (waits for b and c)

In an in-order processor, this takes 103 cycles because each instruction waits for the previous one.

The Solution: Out-of-Order Execution

Out-of-order execution allows the CPU to execute instructions in a different order than they appear in the program, as long as dependencies are respected. This enables:

Parallel execution of independent instructions
Better resource utilization by keeping execution units busy
Hiding latency of slow operations

The same code in an out-of-order processor might execute like this:

cpp

int a = load_from_memory();  // 100 cycles (starts first)
int c = 5 * 2;              // 1 cycle (executes in parallel with load)
int b = a + 1;              // 1 cycle (executes when a is ready)
int d = b + c;              // 1 cycle (executes when b and c are ready)

Total time: ~101 cycles instead of 103 cycles.

How Out-of-Order Execution Works

The Tomasulo Algorithm

Modern out-of-order processors use variations of the Tomasulo algorithm, which consists of several key components:

1. Register Renaming

Register renaming eliminates false dependencies by dynamically mapping architectural registers (visible to the program) to physical registers (actual hardware registers).

cpp

// Original code with false dependency
r1 = load(x);     // r1 = architectural register
r1 = r1 + 1;      // This appears to depend on the load
r2 = r1 * 2;      // This depends on the add

// After register renaming
p1 = load(x);     // p1 = physical register
p2 = p1 + 1;      // p2 = different physical register
p3 = p2 * 2;      // p3 = different physical register

2. Reservation Stations

Reservation stations hold instructions that are ready to execute (all operands available) and dispatch them to execution units when the units become available.

cpp

Reservation Station 1: [ADD p2 = p1 + 1] (waiting for p1)
Reservation Station 2: [MUL p3 = p2 * 2] (waiting for p2)
Reservation Station 3: [LOAD p4 = y] (ready to execute)

3. Reorder Buffer (ROB)

The reorder buffer ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.

cpp

ROB Entry 1: [LOAD p1 = x] (completed, ready to retire)
ROB Entry 2: [ADD p2 = p1 + 1] (completed, waiting for entry 1)
ROB Entry 3: [MUL p3 = p2 * 2] (completed, waiting for entry 2)

Execution Flow

Fetch: Instructions are fetched from memory
Decode: Instructions are decoded and dependencies identified
Rename: Architectural registers are renamed to physical registers
Dispatch: Instructions are sent to reservation stations
Execute: Instructions execute when operands are available
Writeback: Results are written to physical registers
Retire: Instructions are retired in program order

Data Dependencies

Types of Dependencies

1. Data Dependencies (True Dependencies)

When one instruction depends on the result of another instruction:

cpp

int a = load(x);  // Instruction A
int b = a + 1;    // Instruction B depends on A

2. Anti-Dependencies

When one instruction writes to a register that a later instruction reads:

cpp

int a = 5;        // Instruction A writes to a
int b = a + 1;    // Instruction B reads from a
int a = 10;       // Instruction C writes to a

3. Output Dependencies

When multiple instructions write to the same register:

cpp

int a = 5;        // Instruction A writes to a
int a = 10;       // Instruction B writes to a

Dependency Resolution

cpp

// Original code with dependencies
int a = 5;        // p1 = 5
int b = a + 1;    // p2 = p1 + 1
int a = 10;       // p3 = 10 (different physical register)

// After renaming, no dependencies between the add and the second assignment

Memory Ordering

The Memory Ordering Problem

In multi-threaded programs, instruction reordering can cause memory ordering problems:

cpp

// Thread 1
x = 1;            // Store to x
ready = 1;        // Store to ready

// Thread 2
while (!ready);   // Load from ready
print(x);         // Load from x

Without proper memory ordering, Thread 2 might see ready = 1 but x = 0, which violates the expected ordering.

Memory Barriers

Memory barriers are instructions that prevent certain types of instruction reordering:

1. Full Memory Barrier

Prevents all reordering across the barrier:

cpp

x = 1;
memory_barrier();  // All stores before barrier complete before loads after
y = load(z);

2. Load-Load Barrier

Prevents reordering of loads:

cpp

int a = load(x);
load_load_barrier();
int b = load(y);  // Cannot be reordered before load(x)

3. Store-Store Barrier

Prevents reordering of stores:

cpp

x = 1;
store_store_barrier();
y = 2;  // Cannot be reordered before x = 1

4. Load-Store Barrier

Prevents reordering of loads and stores:

cpp

int a = load(x);
load_store_barrier();
y = 2;  // Cannot be reordered before load(x)

Acquire Semantics

Prevents reordering of loads before the acquire operation:

cpp

int a = load(x);
acquire_barrier();
int b = load(y);  // Cannot be reordered before load(x)

Release Semantics

Prevents reordering of stores after the release operation:

cpp

x = 1;
release_barrier();
y = 2;  // Cannot be reordered after x = 1

Speculative Execution

What is Speculative Execution?

Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded.

Branch Prediction and Speculation

cpp

if (condition) {
    x = expensive_calculation();
} else {
    x = simple_calculation();
}

The CPU might:

Predict the branch outcome
Speculatively execute the predicted path
Discard results if prediction was wrong

Benefits

Hides branch latency: Continues execution while waiting for branch resolution
Better resource utilization: Keeps execution units busy
Improved performance: Reduces pipeline stalls

Risks

Wasted work: Speculative execution may be discarded
Security vulnerabilities: Spectre and Meltdown attacks
Power consumption: Executing unnecessary instructions

Performance Implications

When Out-of-Order Helps

Out-of-order execution is most beneficial when:

Independent instructions are available
Long-latency operations (memory accesses, divisions) are present
Multiple execution units are available
Register pressure is low

When Out-of-Order Doesn't Help

Out-of-order execution provides little benefit when:

All instructions are dependent on each other
Memory bandwidth is the bottleneck
Cache misses are frequent
Branch mispredictions are common

Real-World Impact

In high-frequency trading and systems programming:

Latency sensitivity: Every cycle matters
Predictable performance: Understanding reordering helps optimize
Memory ordering: Critical for correct multi-threaded code

Hardware Implementation

Modern CPU Architecture

cpp

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Fetch Unit    │───▶│  Decode/Rename  │───▶│ Reservation     │
│                 │    │                 │    │ Stations        │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                                       │
                                                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Reorder       │◀───│  Execution      │◀───│  Physical       │
│   Buffer        │    │  Units          │    │  Registers      │
└─────────────────┘    └─────────────────┘    └─────────────────┘

1. Fetch Unit

Fetches instructions from instruction cache
Handles branch prediction
Maintains instruction pointer

2. Decode/Rename Unit

Decodes instructions
Identifies dependencies
Performs register renaming

3. Reservation Stations

Hold ready instructions
Dispatch to execution units
Track operand availability

4. Execution Units

ALU (arithmetic/logic operations)
FPU (floating-point operations)
LSU (load/store unit)
Branch unit

5. Reorder Buffer

Maintains program order
Handles instruction retirement
Manages exceptions

Programming Considerations

Writing Out-of-Order Friendly Code

1. Minimize Dependencies

cpp

// Bad: Many dependencies
int result = 0;
for (int i = 0; i < N; i++) {
    result += array[i];  // Each iteration depends on previous
}

// Good: Fewer dependencies
int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i];
    sum2 += array[i + 1];  // Independent operations
}
int result = sum1 + sum2;

2. Use Memory Ordering Correctly

cpp

// Correct use of memory barriers
std::atomic<int> flag{0};
int data = 0;

// Thread 1
data = 42;
flag.store(1, std::memory_order_release);

// Thread 2
while (flag.load(std::memory_order_acquire) == 0);
assert(data == 42);  // Guaranteed to be true

Debugging Out-of-Order Issues

1. Memory Ordering Bugs

Symptoms: Inconsistent behavior in multi-threaded code
Tools: ThreadSanitizer, Valgrind Helgrind
Solution: Use proper memory barriers and atomic operations

2. Performance Issues

Symptoms: Poor performance despite good algorithms
Tools: Performance profilers, hardware counters
Solution: Analyze instruction-level parallelism and dependencies

Questions

Q: What is out-of-order execution?

Out-of-order execution is a technique where the CPU executes instructions in a different order than they appear in the program to improve performance. The CPU can execute independent instructions in parallel while waiting for slower operations to complete.

Q: Why do modern CPUs use out-of-order execution?

Modern CPUs use out-of-order execution to make programs run faster by executing independent instructions in parallel. This allows the CPU to continue working while waiting for slow operations like memory accesses to complete.

Q: What is a data dependency?

A data dependency occurs when one instruction depends on the result of another instruction. For example, if instruction B needs the result of instruction A, then B cannot execute until A completes.

Q: What is the reorder buffer (ROB)?

The reorder buffer (ROB) stores completed instructions in program order for retirement. It ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.

Q: What is instruction retirement?

Instruction retirement is when an instruction's results are made visible to the program in program order. Even though instructions may execute out of order, they must retire in program order to maintain the illusion of sequential execution.

Q: What is a memory barrier?

A memory barrier is an instruction that prevents certain types of instruction reordering. It ensures that memory operations before the barrier complete before memory operations after the barrier begin.

Q: What is the difference between acquire and release semantics?

Acquire semantics prevent reordering of loads before the acquire operation, while release semantics prevent reordering of stores after the release operation. This is important for implementing synchronization primitives like mutexes.

Q: What is speculative execution?

Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded. This is commonly used with branch prediction.

Q: What is the purpose of the reservation station?

The reservation station stores instructions waiting to be executed. It holds instructions that are ready to execute (all their operands are available) and dispatches them to execution units when the units become available.

Q: What is register renaming?

Register renaming is dynamically mapping architectural registers (those visible to the program) to physical registers (actual hardware registers) to eliminate false dependencies. This allows more instructions to execute in parallel.

Instruction Reordering and Out-of-Order Execution ​

The Solution: Out-of-Order Execution ​

How Out-of-Order Execution Works ​

The Tomasulo Algorithm ​

1. Register Renaming ​

2. Reservation Stations ​

3. Reorder Buffer (ROB) ​

Execution Flow ​

Data Dependencies ​

Types of Dependencies ​

1. Data Dependencies (True Dependencies) ​

2. Anti-Dependencies ​

3. Output Dependencies ​

Dependency Resolution ​

Memory Ordering ​

The Memory Ordering Problem ​

Memory Barriers ​

1. Full Memory Barrier ​

2. Load-Load Barrier ​

3. Store-Store Barrier ​

4. Load-Store Barrier ​

Acquire Semantics ​

Release Semantics ​

Speculative Execution ​

What is Speculative Execution? ​

Branch Prediction and Speculation ​

Benefits ​

Risks ​

Performance Implications ​

When Out-of-Order Helps ​

When Out-of-Order Doesn't Help ​

Real-World Impact ​

Hardware Implementation ​

Modern CPU Architecture ​

1. Fetch Unit ​

2. Decode/Rename Unit ​

3. Reservation Stations ​

4. Execution Units ​

5. Reorder Buffer ​

Programming Considerations ​

Writing Out-of-Order Friendly Code ​

1. Minimize Dependencies ​

2. Use Memory Ordering Correctly ​

Debugging Out-of-Order Issues ​

1. Memory Ordering Bugs ​

2. Performance Issues ​

Questions ​