CPU Registers and Assembly Instructions

Video: Assembly Language in 100 Seconds

CPU registers are small, fast storage locations built directly into the CPU. Think of them as the CPU's "working memory" - they hold the data the CPU is currently processing. Unlike main memory (RAM), registers are accessed in a single CPU cycleA CPU cycle is the time it takes for the CPU to complete one basic operation. Modern CPUs run at clock speeds measured in GHz, meaning billions of cycles per second. For example, a 3.0 GHz CPU completes 3 billion cycles per second, so each cycle takes about 3 nanoseconds..

x86_64 Register Architecture

Register Characteristics

Size: 64-bit on modern x86_64The 64-bit architecture used in modern Intel and AMD processors processors
Count: Limited number (16 general-purpose + special-purpose)
Speed: 1 CPU cycle access (fastest storage)
Purpose: Hold operands, addresses, and results

General-Purpose Registers

The x86_64 architecture provides 16 general-purpose registers, each 64 bits wide:

cpp

; 64-bit general-purpose registers
RAX, RBX, RCX, RDX    ; Primary data registers
RSI, RDI, RBP, RSP    ; Pointer/index registers  
R8, R9, R10, R11      ; Additional data registers
R12, R13, R14, R15    ; Additional data registers

These registers can be accessed in different sizes:

64-bit: RAX, RBX, RCX, etc.
32-bit: EAX, EBX, ECX, etc. (lower 32 bits)
16-bit: AX, BX, CX, etc. (lower 16 bits)
8-bit: AL, AH, BL, BH, etc. (lower and upper 8 bits)

Instruction Register

The Instruction Register, also called Instruction Pointer (RIP) or Program Counter (PC), is the register contains the address of the next instruction to execute. When running a program, this register informs the processor which machine instruction to execute next.

Assembly Instruction Basics

AssemblyLow-level programming language that directly corresponds to machine code instructions instructions tell the CPU exactly what operations to perform. Different instructions have different costs - some are faster than others. Understanding these costs helps you write high-level code that compiles to efficient assembly.

Instruction Format

cpp

; Basic instruction format
instruction destination, source

; Examples
mov rax, 42          ; Move immediate value 42 to RAX
add rax, rbx         ; Add RBX to RAX, store result in RAX
sub rcx, 10          ; Subtract 10 from RCX

Addressing Modes

Addressing modes specify how the CPU should access data:

cpp

; Immediate addressing (constant value)
mov rax, 42          ; rax = 42

; Register addressing
mov rax, rbx         ; rax = rbx

; Direct memory addressing
mov rax, [0x1000]    ; rax = value at memory address 0x1000

; Register indirect addressing
mov rax, [rbx]       ; rax = value at memory address in RBX

; Base + offset addressing
mov rax, [rbx + 8]   ; rax = value at memory address RBX + 8

; Indexed addressing
mov rax, [rbx + rcx] ; rax = value at memory address RBX + RCX

; Scaled indexed addressing
mov rax, [rbx + rcx*4] ; rax = value at memory address RBX + RCX*4

Base + offset addressing is commonly used for accessing array elements or structure members.

Common Assembly Instructions

Data Movement Instructions

cpp

; MOV - Move data
mov rax, 42          ; Load immediate value
mov rax, rbx         ; Copy register to register
mov rax, [rbx]       ; Load from memory
mov [rax], rbx       ; Store to memory

; LEA - Load effective address (compute address without accessing memory)
lea rax, [rbx + rcx*4] ; rax = rbx + rcx*4 (no memory access)

; PUSH/POP - Stack operations
push rax             ; Push RAX onto stack (RSP -= 8, [RSP] = RAX)
pop rax              ; Pop from stack to RAX (RAX = [RSP], RSP += 8)

; MOVZX/MOVSX - Zero/sign extend
movzx rax, byte [rbx] ; Zero-extend byte to 64-bit
movsx rax, byte [rbx] ; Sign-extend byte to 64-bit

The LEA instruction is particularly useful for address calculations without actually accessing memory, making it faster than equivalent arithmetic operations.

Arithmetic Instructions

cpp

; ADD - Addition
add rax, rbx         ; rax = rax + rbx
add rax, 42          ; rax = rax + 42

; SUB - Subtraction
sub rax, rbx         ; rax = rax - rbx
sub rax, 10          ; rax = rax - 10

; MUL - Unsigned multiplication
mul rbx              ; RDX:RAX = RAX * RBX (64-bit result)

; IMUL - Signed multiplication
imul rax, rbx        ; rax = rax * rbx
imul rax, rbx, 42    ; rax = rbx * 42

; DIV - Unsigned division
div rbx              ; RAX = RDX:RAX / RBX, RDX = remainder

; IDIV - Signed division
idiv rbx             ; RAX = RDX:RAX / RBX, RDX = remainder

; INC/DEC - Increment/Decrement
inc rax              ; rax = rax + 1
dec rax              ; rax = rax - 1

; NEG - Negate
neg rax              ; rax = -rax

Logical Instructions

cpp

; AND - Bitwise AND
and rax, rbx         ; rax = rax & rbx
and rax, 0xFF        ; Clear upper bits, keep only lowest byte

; OR - Bitwise OR
or rax, rbx          ; rax = rax | rbx
or rax, 0x80         ; Set highest bit of lowest byte

; XOR - Bitwise XOR
xor rax, rbx         ; rax = rax ^ rbx
xor rax, rax         ; rax = 0 (fast way to zero register)

; NOT - Bitwise NOT
not rax              ; rax = ~rax

; SHL/SHR - Logical shift left/right
shl rax, 3           ; rax = rax << 3 (multiply by 8)
shr rax, 2           ; rax = rax >> 2 (divide by 4)

; SAL/SAR - Arithmetic shift left/right
sal rax, 1           ; rax = rax << 1 (same as SHL)
sar rax, 1           ; rax = rax >> 1 (preserves sign bit)

XORExclusive OR - a logical operation that returns 1 if exactly one input is 1, otherwise 0 rax, rax is the fastest way to zero a register because it's a single instruction that doesn't require an immediate value.

Comparison and Control Flow

cpp

; CMP - Compare (sets flags, doesn't store result)
cmp rax, rbx         ; Compare RAX with RBX, set flags
cmp rax, 0           ; Compare RAX with 0

; TEST - Bitwise AND (sets flags, doesn't store result)
test rax, rax        ; Test if RAX is zero
test rax, 0x01       ; Test if lowest bit is set

; Conditional jumps (based on flags set by CMP/TEST)
je label             ; Jump if equal (ZF = 1)
jne label            ; Jump if not equal (ZF = 0)
jg label             ; Jump if greater (signed)
jl label             ; Jump if less (signed)
ja label             ; Jump if above (unsigned)
jb label             ; Jump if below (unsigned)

; Unconditional jump
jmp label            ; Always jump

; Function calls
call function_name    ; Call function (push RIP, jump to function)
ret                  ; Return from function (pop RIP, jump back)

The CMP instruction performs a comparison by subtracting the second operand from the first, but doesn't store the result. Instead, it sets status flags that are used by conditional jump instructions.

Practical Examples

Simple Function in Assembly

cpp

; C function: int add(int a, int b) { return a + b; }
section .text
global add

add:
    ; Function prologue
    push rbp          ; Save old base pointer
    mov rbp, rsp      ; Set new base pointer

    ; Function body
    mov rax, rdi      ; First argument (a) is in RDI
    add rax, rsi      ; Second argument (b) is in RSI
                     ; Result is already in RAX (return value)

    ; Function epilogue
    pop rbp           ; Restore old base pointer
    ret               ; Return

In the x86_64 calling convention, the first six integer arguments are passed in registers: RDI, RSI, RDX, RCX, R8, R9. Return values are placed in RAX.

Loop Example

cpp

; C equivalent: for(int i = 0; i < 10; i++) { sum += i; }
section .text
global sum_loop

sum_loop:
    mov rax, 0        ; sum = 0
    mov rcx, 0        ; i = 0

loop_start:
    cmp rcx, 10       ; Compare i with 10
    jge loop_end      ; If i >= 10, exit loop

    add rax, rcx      ; sum += i
    inc rcx           ; i++
    jmp loop_start    ; Jump back to loop start

loop_end:
    ret               ; Return sum (in RAX)

Array Access

cpp

; C equivalent: int sum = arr[0] + arr[1] + arr[2];
section .text
global sum_array

sum_array:
    mov rax, [rdi]        ; Load arr[0] (first element)
    add rax, [rdi + 4]    ; Add arr[1] (second element, 4 bytes offset)
    add rax, [rdi + 8]    ; Add arr[2] (third element, 8 bytes offset)
    ret

Conditional Logic

cpp

; C equivalent: return (a > b) ? a : b;
section .text
global max

max:
    cmp rdi, rsi      ; Compare a (RDI) with b (RSI)
    cmovg rax, rdi    ; If a > b, RAX = RDI
    cmovle rax, rsi   ; If a <= b, RAX = RSI
    ret

Performance Considerations

Register Usage Optimization

cpp

; Good: Use registers efficiently
mov rax, 42          ; Load constant once
mov rbx, rax         ; Copy to another register
add rcx, rax         ; Use same value multiple times

; Bad: Repeated memory access
add rcx, [memory_location]  ; Load from memory each time
add rdx, [memory_location]  ; Load from memory each time
add r8, [memory_location]   ; Load from memory each time

Instruction Selection

cpp

; Fast: Use specialized instructions
xor rax, rax         ; Zero register (faster than mov rax, 0)
lea rax, [rbx + rcx*4] ; Address calculation (no memory access)
test rax, rax        ; Test for zero (faster than cmp rax, 0)

; Slow: Avoid unnecessary operations
mov rax, 0           ; Slower than xor rax, rax
cmp rax, 0           ; Slower than test rax, rax

Exploring Assembly with Godbolt.org

Godbolt Compiler Explorer is an invaluable tool for understanding how your C++ code translates to assembly. It allows you to see the exact assembly instructions generated by different compilers and optimization levels.

Getting Started with Godbolt

Visit https://godbolt.org/
Choose Compiler: Select GCC, Clang, or MSVC
Set Optimization: Try different -O levels
Write C++ Code: Enter your C++ code in the left panel
View Assembly: See the generated assembly in the right panel

What to observe:

How function arguments are passed (RDI, RSI, etc.)
How return values are handled (RAX)
How different operations translate to assembly
How optimization affects the output

Example: Understanding C++ to Assembly Translation

Try this simple C++ function in Godbolt:

cpp

int add(int a, int b) {
    return a + b;
}

You'll see it compiles to something like:

cpp

add(int, int):
    lea     eax, [rdi+rsi]
    ret

The compiler optimizes the addition into a single lea instruction, which is faster than separate mov and add instructions.

Why This Matters for Quantitative Programming

Understanding assembly helps you write efficient code for:

High-frequency trading systems where every nanosecond counts
Risk calculations that process large datasets
Market data processing where throughput is critical
Algorithm optimization where you need to understand the cost of operations

When you understand how your C++ code translates to assembly, you can make informed decisions about:

Which data structures to use
How to structure loops and conditionals
When to use inline functions
How to minimize cache misses
Which compiler optimizations to enable

This knowledge becomes especially important when you move to advanced topics like lock-free programming, memory ordering, and low-latency systems.

Questions

Q: Which instruction is the fastest way to zero a register in x86_64?

XOR RAX, RAX is the fastest way to zero a register because it's a single instruction that doesn't require an immediate value. MOV RAX, 0 requires encoding the immediate value 0, making it slightly larger and potentially slower.

Q: What does the CMP instruction do?

CMP performs a comparison by subtracting the second operand from the first, but doesn't store the result. Instead, it sets the status flags (ZF, SF, CF, OF) based on the result, which are then used by conditional jump instructions.

Q: Which addressing mode is used in the instruction 'MOV RAX, [RBX + 8]'?

The instruction 'MOV RAX, [RBX + 8]' uses base + offset addressing, where RBX is the base register and 8 is the offset. This is commonly used for accessing array elements or structure members.

Q: Which instruction is used to call a function in x86_64 assembly?

The CALL instruction is used to call a function. It pushes the return address (the address of the next instruction) onto the stack and then jumps to the function's address. The RET instruction is used to return from a function.

Q: What is the purpose of the LEA instruction?

LEA (Load Effective Address) computes an address without actually accessing memory. It's commonly used for address calculations and can be faster than equivalent arithmetic operations since it doesn't require a memory access.

Q: In the x86_64 calling convention, where is the first function argument typically passed?

In the x86_64 System V ABI calling convention, the first six integer arguments are passed in registers: RDI, RSI, RDX, RCX, R8, R9. RDI contains the first argument.

CPU Registers and Assembly Instructions ​

x86_64 Register Architecture ​

Register Characteristics ​

General-Purpose Registers ​

Instruction Register ​

Assembly Instruction Basics ​

Instruction Format ​

Addressing Modes ​

Common Assembly Instructions ​

Data Movement Instructions ​

Arithmetic Instructions ​

Logical Instructions ​

Comparison and Control Flow ​

Practical Examples ​

Simple Function in Assembly ​

Loop Example ​

Array Access ​

Conditional Logic ​

Performance Considerations ​

Register Usage Optimization ​

Instruction Selection ​

Exploring Assembly with Godbolt.org ​

Getting Started with Godbolt ​

Example: Understanding C++ to Assembly Translation ​

Why This Matters for Quantitative Programming ​

Questions ​