Pipelined Processors and Instruction Execution

Video: CPU Pipeline - Computerphile

A pipelined processor divides instruction execution into multiple stages, allowing different instructions to be processed simultaneously. This creates an assembly line effect where multiple instructions are in different stages of execution at the same time.

Pipeline Benefits

Increased Throughput: Multiple instructions in flight simultaneously
Better Resource Utilization: Different pipeline stages use different CPU resources
Higher Clock Frequencies: Shorter stages allow faster clock rates
Improved Performance: Instructions per cycle (IPC) increases

Classic 5-Stage Pipeline

The classic RISC pipeline consists of five stages:

cpp

Instruction Fetch → Instruction Decode → Execute → Memory Access → Write Back
      (IF)              (ID)              (EX)         (MEM)         (WB)

1. Instruction Fetch (IF)

cpp

// Instruction fetch stage
struct IF_Stage {
    uint64_t pc;           // Program counter
    uint32_t instruction;  // Fetched instruction
    bool valid;           // Is instruction valid?

    void fetch() {
        instruction = memory_read(pc);  // Read from instruction memory
        pc += 4;                       // Increment program counter
        valid = true;
    }
};

What happens:

Read instruction from memory at current PCProgram Counter - the register that contains the address of the next instruction to execute
Increment program counter
Handle instruction cache misses

2. Instruction Decode (ID)

cpp

// Instruction decode stage
struct ID_Stage {
    uint32_t instruction;
    uint8_t opcode;
    uint8_t rs, rt, rd;    // Register specifiers
    uint16_t immediate;
    uint32_t funct;        // Function field

    void decode() {
        opcode = (instruction >> 26) & 0x3F;
        rs = (instruction >> 21) & 0x1F;
        rt = (instruction >> 16) & 0x1F;
        rd = (instruction >> 11) & 0x1F;
        immediate = instruction & 0xFFFF;
        funct = instruction & 0x3F;
    }
};

What happens:

Parse instruction fields (opcode, registers, immediate)
Read register values
Determine instruction type and control signals

3. Execute (EX)

cpp

// Execute stage
struct EX_Stage {
    uint64_t alu_result;
    uint64_t operand1, operand2;
    uint8_t alu_op;

    void execute() {
        switch (alu_op) {
            case ALU_ADD:
                alu_result = operand1 + operand2;
                break;
            case ALU_SUB:
                alu_result = operand1 - operand2;
                break;
            case ALU_AND:
                alu_result = operand1 & operand2;
                break;
            case ALU_OR:
                alu_result = operand1 | operand2;
                break;
            // ... other operations
        }
    }
};

What happens:

Perform arithmetic/logical operations
Calculate branch targets
Handle ALU operations

4. Memory Access (MEM)

cpp

// Memory access stage
struct MEM_Stage {
    uint64_t address;
    uint64_t data;
    bool mem_read, mem_write;

    void memory_access() {
        if (mem_read) {
            data = memory_read(address);
        }
        if (mem_write) {
            memory_write(address, data);
        }
    }
};

What happens:

Access data memory (load/store instructions)
Handle data cache misses
Memory address calculation

5. Write Back (WB)

cpp

// Write back stage
struct WB_Stage {
    uint64_t result;
    uint8_t write_reg;
    bool reg_write;

    void write_back() {
        if (reg_write) {
            registers[write_reg] = result;
        }
    }
};

What happens:

Write results back to register file
Update program counter for branches
Complete instruction execution

Instruction Parsing and Decoding

Instruction Format

Modern x86_64 instructions use a complex variable-length format:

cpp

// x86_64 instruction format
struct x86_instruction {
    // Prefixes (0-4 bytes)
    uint8_t prefixes[4];
    uint8_t prefix_count;

    // Opcode (1-3 bytes)
    uint8_t opcode[3];
    uint8_t opcode_length;

    // ModR/M byte
    uint8_t modrm;

    // SIB byte (if needed)
    uint8_t sib;

    // Displacement (0-4 bytes)
    int32_t displacement;

    // Immediate (0-8 bytes)
    int64_t immediate;
};

Instruction Decoding Process

cpp

// Simplified x86 instruction decoder
class InstructionDecoder {
private:
    uint8_t* instruction_stream;
    size_t position;

public:
    x86_instruction decode() {
        x86_instruction instr = {};

        // 1. Decode prefixes
        instr.prefix_count = decode_prefixes(&instr.prefixes[0]);
        position += instr.prefix_count;

        // 2. Decode opcode
        instr.opcode_length = decode_opcode(&instr.opcode[0]);
        position += instr.opcode_length;

        // 3. Decode ModR/M (if needed)
        if (needs_modrm(instr.opcode[0])) {
            instr.modrm = instruction_stream[position++];
        }

        // 4. Decode SIB (if needed)
        if (needs_sib(instr.modrm)) {
            instr.sib = instruction_stream[position++];
        }

        // 5. Decode displacement
        instr.displacement = decode_displacement(instr.modrm);

        // 6. Decode immediate
        instr.immediate = decode_immediate(instr.opcode[0]);

        return instr;
    }

private:
    uint8_t decode_prefixes(uint8_t* prefixes) {
        uint8_t count = 0;
        while (count < 4 && is_prefix(instruction_stream[position])) {
            prefixes[count] = instruction_stream[position];
            position++;
            count++;
        }
        return count;
    }

    uint8_t decode_opcode(uint8_t* opcode) {
        opcode[0] = instruction_stream[position];

        // Handle multi-byte opcodes
        if (opcode[0] == 0x0F) {
            opcode[1] = instruction_stream[position + 1];
            return 2;
        }
        return 1;
    }
};

Common Instruction Patterns

cpp

// Example: MOV instruction decoding
void decode_mov_instruction(x86_instruction& instr) {
    switch (instr.opcode[0]) {
        case 0x88:  // MOV r/m8, r8
            // Move register to memory/register
            break;
        case 0x89:  // MOV r/m32, r32
            // Move register to memory/register (32-bit)
            break;
        case 0x8A:  // MOV r8, r/m8
            // Move memory/register to register
            break;
        case 0x8B:  // MOV r32, r/m32
            // Move memory/register to register (32-bit)
            break;
        case 0xC7:  // MOV r/m32, imm32
            // Move immediate to memory/register
            break;
    }
}

Pipeline Hazards and Solutions

Data Hazards

Data hazards occur when an instruction depends on the result of a previous instruction that hasn't completed yet.

cpp

// Data hazard example
add rax, rbx    // Instruction 1: rax = rax + rbx
sub rcx, rax    // Instruction 2: rcx = rcx - rax (depends on Instruction 1)

Solutions:

1. Forwarding (Bypassing)

cpp

// Pipeline forwarding logic
struct ForwardingUnit {
    bool forward_from_ex_to_ex;
    bool forward_from_mem_to_ex;
    bool forward_from_wb_to_ex;

    void check_hazards(uint8_t rs_ex, uint8_t rt_ex, 
                      uint8_t rd_mem, uint8_t rd_wb,
                      bool reg_write_mem, bool reg_write_wb) {

        // Forward from EX/MEM to EX
        forward_from_ex_to_ex = reg_write_mem && 
                               (rd_mem == rs_ex || rd_mem == rt_ex);

        // Forward from MEM/WB to EX
        forward_from_mem_to_ex = reg_write_wb && 
                                (rd_wb == rs_ex || rd_wb == rt_ex);
    }
};

2. Pipeline Stalling

cpp

// Pipeline stall logic
struct HazardUnit {
    bool stall_if_id;
    bool stall_id_ex;
    bool flush_if_id;

    void detect_hazards() {
        // Load-use hazard
        if (is_load_instruction(id_ex.opcode) && 
            (id_ex.rt == if_id.rs || id_ex.rt == if_id.rt)) {
            stall_if_id = true;
            stall_id_ex = true;
        }
    }
};

Control Hazards

Control hazards occur when the pipeline doesn't know the target of a branch instruction.

cpp

// Control hazard example
cmp rax, 0      // Compare rax with 0
je target       // Jump if equal (branch)
add rbx, rcx    // This might be wrong path
sub rdx, rsi    // This might be wrong path
target:
    mov rax, 1  // Target of branch

Solutions:

1. Branch Prediction

cpp

// Simple branch predictor
class BranchPredictor {
private:
    uint8_t prediction_table[1024];  // 2-bit saturating counters

public:
    bool predict(uint64_t pc) {
        uint32_t index = (pc >> 2) & 0x3FF;  // Use PC bits as index
        return (prediction_table[index] >> 1) & 1;  // Predict taken if >= 2
    }

    void update(uint64_t pc, bool taken, bool correct) {
        uint32_t index = (pc >> 2) & 0x3FF;
        if (taken && prediction_table[index] < 3) {
            prediction_table[index]++;
        } else if (!taken && prediction_table[index] > 0) {
            prediction_table[index]--;
        }
    }
};

2. Speculative Execution

cpp

// Speculative execution with rollback
struct SpeculativeState {
    uint64_t checkpoint_pc;
    uint64_t checkpoint_registers[16];
    bool speculative;

    void checkpoint() {
        checkpoint_pc = current_pc;
        memcpy(checkpoint_registers, registers, sizeof(registers));
        speculative = true;
    }

    void rollback() {
        if (speculative) {
            current_pc = checkpoint_pc;
            memcpy(registers, checkpoint_registers, sizeof(registers));
            speculative = false;
        }
    }
};

Performance Analysis

Pipeline Performance Metrics

cpp

// Pipeline performance calculation
struct PipelineMetrics {
    uint64_t total_instructions;
    uint64_t total_cycles;
    uint64_t stalls;
    uint64_t mispredictions;

    double calculate_cpi() {
        return (double)total_cycles / total_instructions;
    }

    double calculate_ipc() {
        return (double)total_instructions / total_cycles;
    }

    double calculate_efficiency() {
        return (double)(total_cycles - stalls) / total_cycles;
    }
};

Performance Optimization

cpp

// Code optimization for pipeline efficiency
void optimized_loop() {
    // Good: No data dependencies between iterations
    for (int i = 0; i < 1000; i++) {
        result[i] = a[i] + b[i];  // Independent operations
    }
}

void unoptimized_loop() {
    // Bad: Data dependency between iterations
    for (int i = 1; i < 1000; i++) {
        result[i] = result[i-1] + data[i];  // Depends on previous iteration
    }
}

Key Concepts Summary

Pipeline Stages

IF: Instruction fetch from memory
ID: Instruction decode and register read
EX: Execute arithmetic/logical operations
MEM: Memory access for load/store
WB: Write results back to registers

Instruction Parsing

Variable-length: x86 instructions can be 1-15 bytes
Complex decoding: Multiple prefixes, opcodes, operands
ModR/M byte: Specifies addressing modes
SIB byte: Scaled indexed addressing

Pipeline Hazards

Data hazards: Dependencies between instructions
Control hazards: Branch prediction and speculation
Structural hazards: Resource conflicts

Performance Optimization

Forwarding: Bypass pipeline stages for data
Branch prediction: Predict branch outcomes
Speculation: Execute instructions before knowing if they're needed
Code optimization: Minimize dependencies and branches

Questions

Q: Which of the following is NOT a stage in the classic 5-stage RISC pipeline?

The classic 5-stage RISC pipeline consists of: Instruction Fetch (IF), Instruction Decode (ID), Execute (EX), Memory Access (MEM), and Write Back (WB). Cache Access (CA) is not a standard pipeline stage - cache access is typically handled within the IF and MEM stages.

Q: What type of pipeline hazard occurs when an instruction depends on the result of a previous instruction that hasn't completed yet?

A data hazard occurs when an instruction needs a result from a previous instruction that is still in the pipeline. For example, if instruction A writes to register R1 and instruction B reads from R1, instruction B must wait for A to complete the write-back stage before it can read the correct value.

Q: Which technique is used to resolve data hazards by sending results directly from one pipeline stage to another?

Instruction forwarding (also called bypassing) resolves data hazards by sending results directly from the execute or memory stage to dependent instructions, bypassing the write-back stage. This allows dependent instructions to continue without waiting for the result to be written to registers.

Q: What is the purpose of branch prediction in a pipelined processor?

Branch prediction is used to avoid control hazards by guessing whether a branch will be taken or not taken. This allows the pipeline to continue executing instructions speculatively instead of stalling until the branch outcome is known. Correct predictions improve performance, while incorrect predictions require flushing the pipeline.

Q: Which of the following code patterns is most pipeline-friendly?

A loop with independent operations in each iteration is most pipeline-friendly because it allows the pipeline to execute multiple iterations simultaneously without data hazards. Operations that don't depend on previous results can be executed in parallel, maximizing pipeline utilization and throughput.

Q: What happens when a branch prediction is incorrect in a pipelined processor?

A: The processor continues normally

When a branch prediction is incorrect, the pipeline must be flushed (all instructions in the pipeline are discarded) and execution must restart from the correct branch target. This causes a performance penalty as the pipeline must refill with the correct instructions.

Q: Which pipeline stage is responsible for calculating memory addresses for load/store instructions?

The Execute (EX) stage is responsible for calculating memory addresses for load/store instructions. The address calculation typically involves adding a base register value and an offset, which is performed by the ALU in the execute stage. The Memory Access (MEM) stage then uses this calculated address to access memory.

Q: What is the main benefit of pipelining in processors?

The main benefit of pipelining is increased instruction throughput. By allowing multiple instructions to be in different stages of execution simultaneously, the pipeline can complete more instructions per clock cycle, improving overall performance even though individual instructions still take the same number of cycles to complete.

Pipelined Processors and Instruction Execution ​

Pipeline Benefits ​

Classic 5-Stage Pipeline ​

1. Instruction Fetch (IF) ​

2. Instruction Decode (ID) ​

3. Execute (EX) ​

4. Memory Access (MEM) ​

5. Write Back (WB) ​

Instruction Parsing and Decoding ​

Instruction Format ​

Instruction Decoding Process ​

Common Instruction Patterns ​

Pipeline Hazards and Solutions ​

Data Hazards ​

1. Forwarding (Bypassing) ​

2. Pipeline Stalling ​

Control Hazards ​

1. Branch Prediction ​

2. Speculative Execution ​

Performance Analysis ​

Pipeline Performance Metrics ​

Performance Optimization ​

Key Concepts Summary ​

Pipeline Stages ​

Instruction Parsing ​

Pipeline Hazards ​

Performance Optimization ​

Questions ​

Pipelined Processors and Instruction Execution

Pipeline Benefits

Classic 5-Stage Pipeline

1. Instruction Fetch (IF)

2. Instruction Decode (ID)

3. Execute (EX)

4. Memory Access (MEM)

5. Write Back (WB)

Instruction Parsing and Decoding

Instruction Format

Instruction Decoding Process

Common Instruction Patterns

Pipeline Hazards and Solutions

Data Hazards

1. Forwarding (Bypassing)

2. Pipeline Stalling

Control Hazards

1. Branch Prediction

2. Speculative Execution

Performance Analysis

Pipeline Performance Metrics

Performance Optimization

Key Concepts Summary

Pipeline Stages

Instruction Parsing

Pipeline Hazards

Performance Optimization

Questions