Branch Prediction

Modern processors use pipelines to achieve high performance. When a branch instruction (like if, while, for) is encountered, the processor must decide which instruction to fetch next before the branch condition is evaluated. This creates a control hazard - the processor must guess the branch outcome to keep the pipeline full.

cpp

// Example of a branch that creates a control hazard
if (x > 0) {
    // Branch taken path
    result = x * 2;
} else {
    // Branch not-taken path  
    result = x / 2;
}
// Next instruction after the if-else block

Without branch prediction, the pipeline would not really know which instruction to fetch next. This will cause the pipeline tostall and wait for the condition to be evaluated, significantly reducing performance. Branch prediction solves this by guessing which path the branch will take and continuing execution speculatively.

Therefore, what actually happens is that the processor will fetch the next instruction it thinks is probably going to get executed. Puts it in the pipeline, and then when the result of the branch is available, the processor will either continue if its prediction was correct, or in case of a misprediction, flush the pipeline and fetch the correct instruction. This is a very expensive operation, and it is why branch prediction is so important.

The pipeline is said to be stalled when it clearing the wrong instructions and its side-effects and fetching the correct instructions to execute.

Performance Impact

Branch Misprediction Penalty

When a branch prediction is wrong, the CPU must:

Flush the pipeline: Discard all instructions from the wrong path
Restart execution: Begin fetching instructions from the correct path
Recover state: Restore any registers that were modified

This misprediction penalty typically costs 10-20 CPU cycles, which is why accurate branch prediction is crucial for performance.

Real-World Impact

Consider this simple loop:

cpp

for (int i = 0; i < 1000000; i++) {
    if (i % 2 == 0) {
        // Even number processing
    } else {
        // Odd number processing
    }
}

Without branch prediction, the pipeline stalls at every iteration, dramatically reducing performance. With branch prediction, the CPU predicts the branch outcome and the pipeline continues smoothly.

Branch Prediction Strategies

The CPU actually keeps a table called the branch history table (BHT) which is a table of the last N branches and their outcomes (that were taken or not taken). This table is used to predict the next branch. This section tries to build intuition for how different strategies can be employed to predict the next branch.

Static Prediction

Always predict not taken: Assume branches are not taken
Always predict taken: Assume branches are taken
Backward taken, forward not taken: Predict backward branches (loops) as taken, forward branches as not taken

It is easy to see that these are not great strategies. The chances of misprediction are very high, and don't take the last few result of the branches into account.

Dynamic Prediction

Branch History Table (BHT): Records the outcome of recent branches
Pattern-based prediction: Uses patterns in branch behavior
Correlated prediction: Considers the outcome of other branches

Local Branch Prediction

A local branch predictor only considers the history of the current branch instruction. It maintains a separate prediction table for each branch, using the branch's own past behavior to predict its future outcome.

cpp

Branch Address → Prediction Table → Prediction
     ↓              ↓                ↓
  0x1234    →   [History]    →   Taken/Not Taken

Each branch has its own entry in the prediction table, containing:

Branch history: Recent outcomes of this specific branch
Prediction state: Current prediction for this branch

Consider a branch that checks if a number is positive:

cpp

if (x > 0) {
    // Process positive number
}

If this branch has been taken 8 times in a row, the local predictor will predict "taken" for the next occurrence, regardless of what other branches in the program are doing.

Advantages: Simple and fast, good for predictable branches, low hardware cost Disadvantages: Cannot capture correlations, limited history, poor for complex patterns

Global Branch Prediction

A global branch predictor considers the history of all branches in the program, not just the current branch. It uses a Global History Register (GHR) to track the outcomes of recent branches.

cpp

Global History Register → Prediction Table → Prediction
         ↓                    ↓                ↓
   [T,N,T,T,N]    →      [Pattern]    →   Taken/Not Taken

The GHR stores a pattern of recent branch outcomes (T = Taken, N = Not Taken), and the prediction table uses this pattern to make predictions.

Consider this code:

cpp

if (x > 0) {        // Branch A
    if (y > 0) {    // Branch B
        // Both positive
    }
}

If Branch A is taken, then Branch B is likely to be taken as well. A global predictor can capture this correlation:

GHR pattern: [A_taken, B_taken, A_taken, B_taken, ...]
Prediction: When A is taken, predict B will be taken

Advantages: Captures correlations, better for complex patterns, more accurate Disadvantages: More complex, higher latency, history aliasing

2-Bit Saturating Counter

A 2-bit saturating counter is a fundamental component used in branch predictors. It's a 2-bit counter that can count from 00 to 11 and "saturates" at its maximum value (doesn't wrap around).

The Four States

State	Binary	Meaning	Prediction
Strongly Not Taken	00	Very confident branch will not be taken	Predict Not Taken
Weakly Not Taken	01	Somewhat confident branch will not be taken	Predict Not Taken
Weakly Taken	10	Somewhat confident branch will be taken	Predict Taken
Strongly Taken	11	Very confident branch will be taken	Predict Taken

Think of it as a state machine that can be in one of four states. The state is determined by the last N branches and their outcomes. The counter is incremented if the branch is taken, and decremented if the branch is not taken.

How the Counter Works

The counter changes state based on the actual branch outcome:

cpp

Strongly Not Taken (00) ←→ Weakly Not Taken (01) ←→ Weakly Taken (10) ←→ Strongly Taken (11)
        ↑                                                                        ↑
    Branch Not Taken                                                    Branch Taken

Rules:

If branch is taken: Increment counter (unless already at 11)
If branch is not taken: Decrement counter (unless already at 00)

Example State Transitions

Consider a branch that is usually taken but occasionally not taken:

cpp

Initial: Strongly Taken (11)
Branch Taken → Strongly Taken (11) [stays at max]
Branch Taken → Strongly Taken (11) [stays at max]
Branch Not Taken → Weakly Taken (10) [decrements]
Branch Taken → Strongly Taken (11) [increments]
Branch Not Taken → Weakly Taken (10) [decrements]

Saturating counters are used because they provide hysteresis (require multiple wrong predictions to change state), stability (prevent oscillation), and confidence (higher states indicate higher confidence).

Modern Branch Predictors

Hybrid Predictors

Modern CPUs use sophisticated hybrid predictors that combine:

Local predictors: For branches with consistent patterns
Global predictors: For branches with correlations
Loop predictors: Specialized for loop branches
Return stack buffers: For function return addresses

Prediction Accuracy

Typical branch prediction accuracy:

Simple predictors: 70-80% accuracy
Modern predictors: 90-95% accuracy
Loop branches: 95-99% accuracy (very predictable)

Optimization Techniques

Data Layout Optimization

Sort data to make branches more predictable:

cpp

// Before: Unpredictable branches
std::vector<int> unsorted = {5, 1, 8, 2, 9, 3, 7, 4, 6};
int result = sum_above_threshold(unsorted, 5);

// After: Predictable branches
std::sort(unsorted.begin(), unsorted.end());
int result = sum_above_threshold(unsorted, 5);

Branchless Programming

Replace branches with arithmetic operations:

cpp

// Branchy version
int max_branchy(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branchless version
int max_branchless(int a, int b) {
    return a ^ ((a ^ b) & -(a < b));
}

// Alternative using conditional move
int max_cmov(int a, int b) {
    return (a > b) ? a : b; // Compiler may generate CMOV
}

The C++ compiler may actually optimize the branchy version to the branchless version automatically! This is a good example of how the compiler can help you write better code.

Loop Unrolling

Reduce the number of branches in loops:

cpp

// Before: One branch per iteration
for (int i = 0; i < n; i++) {
    if (data[i] > threshold) sum += data[i];
}

// After: One branch per 4 iterations
for (int i = 0; i < n; i += 4) {
    if (data[i] > threshold) sum += data[i];
    if (data[i+1] > threshold) sum += data[i+1];
    if (data[i+2] > threshold) sum += data[i+2];
    if (data[i+3] > threshold) sum += data[i+3];
}

Similarly, the compiler may optimize the loop to unroll it automatically!

Writing Branch-Friendly Code

cpp

// Good: Predictable branches
for (int i = 0; i < N; i++) {
    if (i < N/2) {  // Predictable pattern
        processA();
    } else {
        processB();
    }
}

// Bad: Unpredictable branches
for (int i = 0; i < N; i++) {
    if (random() % 2 == 0) {  // Random pattern
        processA();
    } else {
        processB();
    }
}

Conditional Elimination

cpp

// Bad: Multiple branches
if (x > 0) {
    if (y > 0) {
        result = x + y;
    } else {
        result = x - y;
    }
} else {
    if (y > 0) {
        result = -x + y;
    } else {
        result = -x - y;
    }
}

// Good: Fewer branches
result = (x > 0 ? x : -x) + (y > 0 ? y : -y);

Modern CPU Implementations

Intel Branch Predictors

Intel CPUs use sophisticated hybrid predictors:

Local predictors: 1024-entry pattern history tables
Global predictors: 4096-entry global history tables
Loop predictors: Specialized for loop branches
Indirect predictors: For function pointers and virtual calls

AMD Branch Predictors

AMD CPUs use similar hybrid approaches:

TAGE predictor: Tagged Geometric History Length
Loop predictors: Optimized for loop branches
Return stack buffers: For function returns

ARM Branch Predictors

ARM processors use:

Global history predictors: For general branches
Loop predictors: For loop optimization
Return address predictors: For function calls

Practical Guidelines

When to Optimize

Hot paths: Focus on frequently executed code
Predictable patterns: Look for data-dependent branches
Performance-critical sections: Trading algorithms, real-time systems

When Not to Optimize

Cold paths: Rarely executed code
Unpredictable data: Random or adversarial inputs
Readability trade-offs: When optimization hurts maintainability

Measurement

Always measure before and after optimization:

cpp

#include <chrono>

auto start = std::chrono::high_resolution_clock::now();
// ... code to measure ...
auto end = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
std::cout << "Execution time: " << duration.count() << " microseconds\n";

Questions

Q: What is the main purpose of branch prediction?

Branch prediction predicts the outcome of conditional branches (if/else, loops) to maintain instruction pipeline flow. Without prediction, the pipeline would stall every time it encounters a branch, significantly reducing performance.

Q: What is a local branch predictor?

Q: What is a global branch predictor?

A global branch predictor considers the history of all branches in the program, not just the current branch. It uses a global history register to track the outcomes of recent branches and uses this information to predict the current branch.

Q: What is a 2-bit saturating counter?

A 2-bit saturating counter is a 2-bit counter that can count from 00 to 11 and saturates at its maximum value. It's commonly used in branch predictors to track branch history and make predictions based on the counter state.

Q: What are the four states of a 2-bit saturating counter?

The four states of a 2-bit saturating counter are: Strongly Not Taken (00), Weakly Not Taken (01), Weakly Taken (10), and Strongly Taken (11). These states represent the confidence level in predicting whether a branch will be taken or not.

Q: What happens when a branch prediction is correct?

A: The pipeline stalls and waits for the correct path

When a branch prediction is correct, the pipeline continues smoothly without any stalls. The CPU can continue executing instructions from the predicted path, maintaining high instruction throughput and performance.

Q: What is a branch misprediction penalty?

A branch misprediction penalty is the performance cost when a branch prediction is wrong. The CPU must flush the pipeline, discard incorrectly fetched instructions, and restart execution from the correct path, which can cost 10-20 cycles.

Q: What is a global history register (GHR)?

A global history register (GHR) stores the history of recent branch outcomes (taken/not taken) from all branches in the program. This information is used by global branch predictors to make predictions about the current branch.

Q: What is the advantage of a global branch predictor over a local predictor?

A global branch predictor can capture correlations between different branches. For example, if branch A is taken, then branch B is likely to be taken. Local predictors cannot capture these correlations since they only consider individual branch history.

Q: What is speculative execution?

Speculative execution is executing instructions before knowing if they should be executed. The CPU executes instructions from the predicted path of a branch, hoping the prediction is correct. If wrong, the results are discarded.

Branch Prediction ​

Performance Impact ​

Branch Misprediction Penalty ​

Real-World Impact ​

Branch Prediction Strategies ​

Static Prediction ​

Dynamic Prediction ​

Local Branch Prediction ​

Global Branch Prediction ​

2-Bit Saturating Counter ​

The Four States ​

How the Counter Works ​

Example State Transitions ​

Modern Branch Predictors ​

Hybrid Predictors ​

Prediction Accuracy ​

Optimization Techniques ​

Data Layout Optimization ​

Branchless Programming ​

Loop Unrolling ​

Writing Branch-Friendly Code ​

Conditional Elimination ​

Modern CPU Implementations ​

Intel Branch Predictors ​

AMD Branch Predictors ​

ARM Branch Predictors ​

Practical Guidelines ​

When to Optimize ​

When Not to Optimize ​

Measurement ​

Questions ​