Skip to content

Instruction Reordering and Out-of-Order Execution

Not every CPU instruction takes the same amount of cycles. Floating point operations like mul and div are very expensive as opposed to operations like lshift or rshift. Further more, some operations may require accessing memory and stall subsequent instructions. Instruction reordering solves this exact dilemma - what if we could process independent instructions while waiting for a more expensive operation to complete?

In a simple in-order processor, instructions are executed exactly in the order they appear in the program. This creates a significant performance bottleneck because:

  • Long-latency operations (like memory accesses) block subsequent instructions
  • Independent instructions cannot execute in parallel
  • CPU resources remain idle while waiting for slow operations

For example, consider this code:

cpp
int a = load_from_memory();  // 100 cycles
int b = a + 1;              // 1 cycle (waits for a)
int c = 5 * 2;              // 1 cycle (independent, but waits)
int d = b + c;              // 1 cycle (waits for b and c)

In an in-order processor, this takes 103 cycles because each instruction waits for the previous one.

The Solution: Out-of-Order Execution

Out-of-order execution allows the CPU to execute instructions in a different order than they appear in the program, as long as dependencies are respected. This enables:

  • Parallel execution of independent instructions
  • Better resource utilization by keeping execution units busy
  • Hiding latency of slow operations

The same code in an out-of-order processor might execute like this:

cpp
int a = load_from_memory();  // 100 cycles (starts first)
int c = 5 * 2;              // 1 cycle (executes in parallel with load)
int b = a + 1;              // 1 cycle (executes when a is ready)
int d = b + c;              // 1 cycle (executes when b and c are ready)

Total time: ~101 cycles instead of 103 cycles.

How Out-of-Order Execution Works

The Tomasulo Algorithm

Modern out-of-order processors use variations of the Tomasulo algorithm, which consists of several key components:

1. Register Renaming

Register renaming eliminates false dependencies by dynamically mapping architectural registers (visible to the program) to physical registers (actual hardware registers).

cpp
// Original code with false dependency
r1 = load(x);     // r1 = architectural register
r1 = r1 + 1;      // This appears to depend on the load
r2 = r1 * 2;      // This depends on the add

// After register renaming
p1 = load(x);     // p1 = physical register
p2 = p1 + 1;      // p2 = different physical register
p3 = p2 * 2;      // p3 = different physical register

2. Reservation Stations

Reservation stations hold instructions that are ready to execute (all operands available) and dispatch them to execution units when the units become available.

cpp
Reservation Station 1: [ADD p2 = p1 + 1] (waiting for p1)
Reservation Station 2: [MUL p3 = p2 * 2] (waiting for p2)
Reservation Station 3: [LOAD p4 = y] (ready to execute)

3. Reorder Buffer (ROB)

The reorder buffer ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.

cpp
ROB Entry 1: [LOAD p1 = x] (completed, ready to retire)
ROB Entry 2: [ADD p2 = p1 + 1] (completed, waiting for entry 1)
ROB Entry 3: [MUL p3 = p2 * 2] (completed, waiting for entry 2)

Execution Flow

  1. Fetch: Instructions are fetched from memory
  2. Decode: Instructions are decoded and dependencies identified
  3. Rename: Architectural registers are renamed to physical registers
  4. Dispatch: Instructions are sent to reservation stations
  5. Execute: Instructions execute when operands are available
  6. Writeback: Results are written to physical registers
  7. Retire: Instructions are retired in program order

Data Dependencies

Types of Dependencies

1. Data Dependencies (True Dependencies)

When one instruction depends on the result of another instruction:

cpp
int a = load(x);  // Instruction A
int b = a + 1;    // Instruction B depends on A

2. Anti-Dependencies

When one instruction writes to a register that a later instruction reads:

cpp
int a = 5;        // Instruction A writes to a
int b = a + 1;    // Instruction B reads from a
int a = 10;       // Instruction C writes to a

3. Output Dependencies

When multiple instructions write to the same register:

cpp
int a = 5;        // Instruction A writes to a
int a = 10;       // Instruction B writes to a

Dependency Resolution

Register renaming eliminates anti-dependencies and output dependencies by using different physical registers:

cpp
// Original code with dependencies
int a = 5;        // p1 = 5
int b = a + 1;    // p2 = p1 + 1
int a = 10;       // p3 = 10 (different physical register)

// After renaming, no dependencies between the add and the second assignment

Memory Ordering

The Memory Ordering Problem

In multi-threaded programs, instruction reordering can cause memory ordering problems:

cpp
// Thread 1
x = 1;            // Store to x
ready = 1;        // Store to ready

// Thread 2
while (!ready);   // Load from ready
print(x);         // Load from x

Without proper memory ordering, Thread 2 might see ready = 1 but x = 0, which violates the expected ordering.

Memory Barriers

Memory barriers are instructions that prevent certain types of instruction reordering:

1. Full Memory Barrier

Prevents all reordering across the barrier:

cpp
x = 1;
memory_barrier();  // All stores before barrier complete before loads after
y = load(z);

2. Load-Load Barrier

Prevents reordering of loads:

cpp
int a = load(x);
load_load_barrier();
int b = load(y);  // Cannot be reordered before load(x)

3. Store-Store Barrier

Prevents reordering of stores:

cpp
x = 1;
store_store_barrier();
y = 2;  // Cannot be reordered before x = 1

4. Load-Store Barrier

Prevents reordering of loads and stores:

cpp
int a = load(x);
load_store_barrier();
y = 2;  // Cannot be reordered before load(x)

Acquire Semantics

Prevents reordering of loads before the acquire operation:

cpp
int a = load(x);
acquire_barrier();
int b = load(y);  // Cannot be reordered before load(x)

Release Semantics

Prevents reordering of stores after the release operation:

cpp
x = 1;
release_barrier();
y = 2;  // Cannot be reordered after x = 1

Speculative Execution

What is Speculative Execution?

Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded.

Branch Prediction and Speculation

cpp
if (condition) {
    x = expensive_calculation();
} else {
    x = simple_calculation();
}

The CPU might:

  1. Predict the branch outcome
  2. Speculatively execute the predicted path
  3. Discard results if prediction was wrong

Benefits

  • Hides branch latency: Continues execution while waiting for branch resolution
  • Better resource utilization: Keeps execution units busy
  • Improved performance: Reduces pipeline stalls

Risks

  • Wasted work: Speculative execution may be discarded
  • Security vulnerabilities: Spectre and Meltdown attacks
  • Power consumption: Executing unnecessary instructions

Performance Implications

When Out-of-Order Helps

Out-of-order execution is most beneficial when:

  • Independent instructions are available
  • Long-latency operations (memory accesses, divisions) are present
  • Multiple execution units are available
  • Register pressure is low

When Out-of-Order Doesn't Help

Out-of-order execution provides little benefit when:

  • All instructions are dependent on each other
  • Memory bandwidth is the bottleneck
  • Cache misses are frequent
  • Branch mispredictions are common

Real-World Impact

In high-frequency trading and systems programming:

  • Latency sensitivity: Every cycle matters
  • Predictable performance: Understanding reordering helps optimize
  • Memory ordering: Critical for correct multi-threaded code

Hardware Implementation

Modern CPU Architecture

cpp
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Fetch Unit    │───▶│  Decode/Rename  │───▶│ Reservation     │
│                 │    │                 │    │ Stations        │
└─────────────────┘    └─────────────────┘    └─────────────────┘


┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Reorder       │◀───│  Execution      │◀───│  Physical       │
│   Buffer        │    │  Units          │    │  Registers      │
└─────────────────┘    └─────────────────┘    └─────────────────┘

1. Fetch Unit

  • Fetches instructions from instruction cache
  • Handles branch prediction
  • Maintains instruction pointer

2. Decode/Rename Unit

  • Decodes instructions
  • Identifies dependencies
  • Performs register renaming

3. Reservation Stations

  • Hold ready instructions
  • Dispatch to execution units
  • Track operand availability

4. Execution Units

  • ALU (arithmetic/logic operations)
  • FPU (floating-point operations)
  • LSU (load/store unit)
  • Branch unit

5. Reorder Buffer

  • Maintains program order
  • Handles instruction retirement
  • Manages exceptions

Programming Considerations

Writing Out-of-Order Friendly Code

1. Minimize Dependencies

cpp
// Bad: Many dependencies
int result = 0;
for (int i = 0; i < N; i++) {
    result += array[i];  // Each iteration depends on previous
}

// Good: Fewer dependencies
int sum1 = 0, sum2 = 0;
for (int i = 0; i < N; i += 2) {
    sum1 += array[i];
    sum2 += array[i + 1];  // Independent operations
}
int result = sum1 + sum2;

2. Use Memory Ordering Correctly

cpp
// Correct use of memory barriers
std::atomic<int> flag{0};
int data = 0;

// Thread 1
data = 42;
flag.store(1, std::memory_order_release);

// Thread 2
while (flag.load(std::memory_order_acquire) == 0);
assert(data == 42);  // Guaranteed to be true

Debugging Out-of-Order Issues

1. Memory Ordering Bugs

  • Symptoms: Inconsistent behavior in multi-threaded code
  • Tools: ThreadSanitizer, Valgrind Helgrind
  • Solution: Use proper memory barriers and atomic operations

2. Performance Issues

  • Symptoms: Poor performance despite good algorithms
  • Tools: Performance profilers, hardware counters
  • Solution: Analyze instruction-level parallelism and dependencies

Questions

Q: What is out-of-order execution?

Out-of-order execution is a technique where the CPU executes instructions in a different order than they appear in the program to improve performance. The CPU can execute independent instructions in parallel while waiting for slower operations to complete.

Q: Why do modern CPUs use out-of-order execution?

Modern CPUs use out-of-order execution to make programs run faster by executing independent instructions in parallel. This allows the CPU to continue working while waiting for slow operations like memory accesses to complete.

Q: What is a data dependency?

A data dependency occurs when one instruction depends on the result of another instruction. For example, if instruction B needs the result of instruction A, then B cannot execute until A completes.

Q: What is the reorder buffer (ROB)?

The reorder buffer (ROB) stores completed instructions in program order for retirement. It ensures that instructions are retired (made visible to the program) in the same order they appeared in the original program, even though they may have executed out of order.

Q: What is instruction retirement?

Instruction retirement is when an instruction's results are made visible to the program in program order. Even though instructions may execute out of order, they must retire in program order to maintain the illusion of sequential execution.

Q: What is a memory barrier?

A memory barrier is an instruction that prevents certain types of instruction reordering. It ensures that memory operations before the barrier complete before memory operations after the barrier begin.

Q: What is the difference between acquire and release semantics?

Acquire semantics prevent reordering of loads before the acquire operation, while release semantics prevent reordering of stores after the release operation. This is important for implementing synchronization primitives like mutexes.

Q: What is speculative execution?

Speculative execution is executing instructions that may not be needed, hoping they will be useful. If the speculation is wrong, the results are discarded. This is commonly used with branch prediction.

Q: What is the purpose of the reservation station?

The reservation station stores instructions waiting to be executed. It holds instructions that are ready to execute (all their operands are available) and dispatches them to execution units when the units become available.

Q: What is register renaming?

Register renaming is dynamically mapping architectural registers (those visible to the program) to physical registers (actual hardware registers) to eliminate false dependencies. This allows more instructions to execute in parallel.