NUMA Architecture

Video: Non-Uniform Memory Architecture (NUMA): A Nearly Unfathomable Morass of Arcana - Fedor Pikus CppNow

NUMA stands for Non-Uniform Memory Access. It's a computer memory design where memory access times depend on the memory location relative to the processor. In NUMA systems, memory that is local to a processor can be accessed much faster than memory that is remote (belonging to a different processor).

The NUMA Problem

Traditional SMP vs NUMA

In traditional Symmetric Multiprocessing (SMP) systems, all processors share a single memory bus, and all memory accesses have the same latency regardless of which processor makes the access.

In NUMA systems, each processor has its own local memory, and accessing remote memory requires going through an interconnect, which adds significant latency.

Why NUMA Exists

NUMA architectures were developed to solve the memory bandwidth bottleneck in large multi-socket systems. As the number of processors increased, the shared memory bus became a performance bottleneck. NUMA provides:

Higher memory bandwidth: Each node has its own memory controller
Lower latency: Local memory access is faster
Better scalability: Performance scales with the number of nodes

NUMA Node Structure

What is a NUMA Node?

A NUMA node is a group of processors (CPUs) and their local memory that can be accessed with uniform latency. Each NUMA node contains:

One or more CPU sockets
Associated CPU cores
Local memory directly connected to those sockets
Memory controller for the local memory

Typical NUMA Configuration

cpp

NUMA Node 0                    NUMA Node 1
┌─────────────────┐           ┌─────────────────┐
│   CPU Socket 0  │           │   CPU Socket 1  │
│   ┌─────────┐   │           │   ┌─────────┐   │
│   │ Core 0  │   │           │   │ Core 4  │   │
│   │ Core 1  │   │           │   │ Core 5  │   │
│   │ Core 2  │   │           │   │ Core 6  │   │
│   │ Core 3  │   │           │   │ Core 7  │   │
│   └─────────┘   │           │   └─────────┘   │
│                 │           │                 │
│   Local Memory  │◄─────────►│   Local Memory  │
│   (32GB)       │           │   (32GB)       │
└─────────────────┘           └─────────────────┘

Local Memory Access

Source: Processor accessing its own local memory
Latency: ~100-200 nanoseconds
Bandwidth: Full memory controller bandwidth
Path: Direct connection through local memory controller

Remote Memory Access

Source: Processor accessing memory from another node
Latency: ~200-400 nanoseconds (2-3x slower)
Bandwidth: Limited by interconnect bandwidth
Path: Through interconnect to remote memory controller

NUMA Performance Characteristics

Latency Comparison

Access Type	Latency	Relative Cost
L1 Cache Hit	~1-2 cycles	1x
L2 Cache Hit	~10-20 cycles	5-10x
L3 Cache Hit	~40-80 cycles	20-40x
Local Memory	~100-200 cycles	50-100x
Remote Memory	~200-400 cycles	100-200x

Bandwidth Characteristics

Local memory bandwidth: Full memory controller bandwidth (e.g., 50-100 GB/s)
Remote memory bandwidth: Limited by interconnect (e.g., 20-40 GB/s)
Interconnect bandwidth: Shared among all nodes

Device Placement and Interconnect Overload

Devices such as network cards, storage controllers, and GPUs are often mounted on specific NUMA nodes. When a device is mounted on a different NUMA node than the CPU processing the data, all device I/O operations must traverse the interconnect, potentially causing:

Interconnect congestion: High-bandwidth devices can saturate the cross-node link
Increased latency: Device access requires remote memory operations
Reduced bandwidth: Shared interconnect bandwidth limits overall performance

cpp

// Example: Network card on NUMA node 1, processing on NUMA node 0
// All network data must cross the interconnect
void process_network_data() {
    // Network buffer allocated on device's NUMA node (node 1)
    void* network_buffer = numa_alloc_onnode(buffer_size, 1);

    // Processing happens on CPU node 0
    // Data must be transferred across interconnect
    process_data_on_node(network_buffer, buffer_size, 0);
}

L3 Cache Cross-Connect and Shared Caches

Modern multi-socket systems use a shared L3 cache architecture where the last-level cache (LLC) is distributed across NUMA nodes but accessible to all nodes through a high-speed interconnect.

L3 Cache Organization

cpp

NUMA Node 0                    NUMA Node 1
┌─────────────────┐           ┌─────────────────┐
│   CPU Socket 0  │           │   CPU Socket 1  │
│   ┌─────────┐   │           │   ┌─────────┐   │
│   │ Core 0  │   │           │   │ Core 4  │   │
│   │ Core 1  │   │           │   │ Core 5  │   │
│   │ Core 2  │   │           │   │ Core 6  │   │
│   │ Core 3  │   │           │   │ Core 7  │   │
│   └─────────┘   │           │   └─────────┘   │
│                 │           │                 │
│   L3 Cache      │◄─────────►│   L3 Cache      │
│   (Slice 0)     │           │   (Slice 1)     │
└─────────────────┘           └─────────────────┘

Cross-Connect Performance Implications

The L3 cache cross-connect can become a performance bottleneck when:

Cache Coherency Traffic: Multiple nodes accessing the same cache lines
Cache Misses: Remote L3 cache accesses require interconnect traversal
Bandwidth Saturation: High memory bandwidth applications overwhelm the cross-connect

cpp

// Example: False sharing across NUMA nodes
struct SharedData {
    int data[2];  // Adjacent cache lines
};

// Thread on node 0 writes to data[0]
// Thread on node 1 writes to data[1]
// Both cause cache coherency traffic across interconnect

Cache Coherency Protocol Overhead

The cache coherency protocol (MESI/MOESI) generates significant traffic across the interconnect:

Invalidation messages: When one node modifies a cache line
Snoop requests: Checking if other nodes have copies of cache lines
Data transfers: Moving cache lines between nodes

cpp

// Example: Cache line ping-pong between nodes
void cache_line_ping_pong() {
    volatile int shared_counter = 0;

    // Thread on node 0
    #pragma omp parallel for
    for (int i = 0; i < 1000000; i++) {
        shared_counter++;  // Causes cache line transfers between nodes
    }
}

Real-World Impact

In high-frequency trading and systems programming:

Latency sensitivity: Every nanosecond of memory access matters
Memory-bound applications: Performance limited by memory access
Multi-threaded workloads: Threads accessing remote memory
Device I/O: Network and storage devices on different NUMA nodes
Cache contention: Shared cache lines causing interconnect traffic

NUMA-Aware Software Design

The Goal

The main goal of NUMA-aware software design is to minimize remote memory accesses and maximize local memory accesses. This involves:

Memory placement: Allocating memory on the correct NUMA node
CPU affinity: Binding threads to specific CPU cores
Data distribution: Organizing data to minimize cross-node access

First-Touch Policy

Most operating systems use a first-touch policy for memory allocation:

Memory is allocated on the NUMA node where it's first accessed
The thread that first touches a memory page determines its location

Explicit Memory Placement

cpp

// Linux: Using numa_alloc_onnode
void* local_memory = numa_alloc_onnode(size, node_id);

// Windows: Using VirtualAllocExNuma
void* local_memory = VirtualAllocExNuma(
    GetCurrentProcess(), NULL, size,
    MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE, node_id
);

Thread Binding

Binding threads to specific CPU cores ensures they access local memory:

cpp

// Linux: Using pthread_setaffinity_np
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_setaffinity_np(thread_id, sizeof(cpu_set_t), &cpuset);

// Windows: Using SetThreadAffinityMask
SetThreadAffinityMask(GetCurrentThread(), (1ULL << core_id));

NUMA-Aware Thread Creation

cpp

// Create threads on specific NUMA nodes
for (int node = 0; node < num_nodes; node++) {
    for (int core = 0; core < cores_per_node; core++) {
        int global_core = node * cores_per_node + core;
        create_thread_on_core(global_core);
    }
}

Partitioned Data Structures

cpp

// Each NUMA node owns a portion of the data
struct PartitionedArray {
    int* local_data[NUM_NODES];
    size_t local_size[NUM_NODES];
};

// Allocate local portions
for (int node = 0; node < NUM_NODES; node++) {
    array->local_data[node] = numa_alloc_onnode(
        local_size, node
    );
}

NUMA-Aware Allocators

cpp

// Custom allocator that considers NUMA topology
class NumaAllocator {
public:
    void* allocate(size_t size) {
        int current_node = get_current_numa_node();
        return numa_alloc_onnode(size, current_node);
    }

    void deallocate(void* ptr) {
        numa_free(ptr, size);
    }
};

NUMA Detection and Analysis

Linux Tools

cpp

# View NUMA topology
numactl --hardware

# Show NUMA statistics
cat /proc/zoneinfo | grep -E "Node|pages"

# Monitor NUMA access patterns
perf stat -e numa:* ./your_program

Key Metrics

NUMA hits: Local memory accesses
NUMA misses: Remote memory accesses
Memory bandwidth: Per-node bandwidth utilization
Interconnect traffic: Cross-node communication

Tools

Linux: perf, numastat, numad
Windows: Performance Monitor, Resource Monitor
Intel: VTune Profiler
AMD: AMD μProf

NUMA Optimization Techniques

NUMA-Aware Allocation

cpp

// Allocate memory on the current NUMA node
void* allocate_local(size_t size) {
    int node = numa_node_of_cpu(sched_getcpu());
    return numa_alloc_onnode(size, node);
}

// Allocate memory on a specific NUMA node
void* allocate_on_node(size_t size, int node) {
    return numa_alloc_onnode(size, node);
}

Memory Migration

cpp

// Move memory to a different NUMA node
int migrate_memory(void* ptr, size_t size, int target_node) {
    return numa_move_pages(0, 1, &ptr, &target_node, NULL, 0);
}

NUMA-Aware Thread Creation

cpp

// Create threads on specific NUMA nodes
void create_numa_threads(int num_threads) {
    for (int i = 0; i < num_threads; i++) {
        int node = i % numa_num_configured_nodes();
        int core = get_first_core_on_node(node);

        pthread_t thread;
        pthread_create(&thread, NULL, worker_function, NULL);
        bind_thread_to_core(thread, core);
    }
}

Work Distribution

cpp

// Distribute work based on NUMA topology
void distribute_work_numa(WorkItem* items, int num_items) {
    int num_nodes = numa_num_configured_nodes();
    int items_per_node = num_items / num_nodes;

    for (int node = 0; node < num_nodes; node++) {
        int start = node * items_per_node;
        int end = (node == num_nodes - 1) ? num_items : start + items_per_node;

        // Assign work to threads on this node
        assign_work_to_node(items + start, end - start, node);
    }
}

NUMA-Friendly Data Layout

cpp

// Structure of Arrays (SoA) for NUMA
struct NumaFriendlyData {
    // Each array allocated on a different NUMA node
    float* x[NUM_NODES];
    float* y[NUM_NODES];
    float* z[NUM_NODES];

    // Thread-local storage
    struct ThreadData {
        int node_id;
        int local_start;
        int local_end;
    } threads[MAX_THREADS];
};

Cache-Line Aware Padding

cpp

// Prevent false sharing across NUMA nodes
struct NumaAlignedData {
    alignas(64) int data[NUM_NODES][CACHE_LINE_SIZE / sizeof(int)];
};

Real-World Applications

High-Frequency Trading

In HFT systems, NUMA awareness is critical:

Latency sensitivity: Every nanosecond of memory access matters
Predictable performance: Consistent memory access patterns
Multi-threaded workloads: Market data processing across nodes

Database Systems

Database systems benefit from NUMA optimization:

Memory-bound workloads: Large datasets distributed across nodes
Concurrent access: Multiple threads accessing different data partitions
Buffer pool management: NUMA-aware buffer allocation

Scientific Computing

Scientific applications use NUMA effectively:

Large datasets: Distributed across multiple nodes
Parallel algorithms: Work distribution based on NUMA topology
Memory bandwidth: Maximizing local memory bandwidth

Common NUMA Pitfalls

1. Ignoring NUMA Topology

Problem: Not considering NUMA layout when designing software
Solution: Profile and optimize for NUMA characteristics

2. Poor Memory Placement

Problem: Memory allocated on wrong NUMA nodes
Solution: Use NUMA-aware allocators and explicit placement

3. Inefficient Thread Placement

Problem: Threads accessing remote memory
Solution: Bind threads to appropriate CPU cores

Problem: Cache lines shared between NUMA nodes
Solution: Use proper padding and data layout

Practical Guidelines

When to Optimize for NUMA

Multi-socket systems: Systems with multiple CPU sockets
Memory-bound applications: Applications limited by memory bandwidth
Latency-sensitive workloads: Where memory access time matters
Large datasets: Applications processing large amounts of data

When Not to Optimize for NUMA

Single-socket systems: No NUMA topology to optimize for
CPU-bound applications: Not limited by memory access
Small datasets: Overhead exceeds benefits
Simple applications: Where optimization complexity isn't justified

Measurement

Always measure NUMA performance before and after optimization:

cpp

#include <chrono>
#include <numa.h>

// Measure memory access performance
auto start = std::chrono::high_resolution_clock::now();
// ... memory access code ...
auto end = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << "Memory access time: " << duration.count() << " nanoseconds\n";

// Check NUMA statistics
struct bitmask* nodemask = numa_get_run_node_mask();
std::cout << "Running on NUMA node: " << numa_node_of_cpu(sched_getcpu()) << std::endl;

Understanding NUMA architecture and implementing NUMA-aware software design is essential for achieving optimal performance on modern multi-socket systems, especially in latency-sensitive applications like high-frequency trading.

Questions

Q: What does NUMA stand for?

NUMA stands for Non-Uniform Memory Access. It describes a computer memory design where memory access times depend on the memory location relative to the processor. Memory that is local to a processor can be accessed faster than memory that is remote.

Q: What is a NUMA node?

A NUMA node is a group of processors (CPUs) and their local memory that can be accessed with uniform latency. Each NUMA node contains one or more CPU sockets, their associated cores, and the memory directly connected to those sockets.

Q: What is the main performance issue with NUMA systems?

The main performance issue with NUMA systems is that memory access latency varies depending on whether the memory is local or remote to the processor. Local memory access is much faster than remote memory access, which can cause significant performance degradation if not managed properly.

Q: What is local memory in a NUMA system?

Local memory in a NUMA system is memory that is directly connected to a specific processor socket. This memory can be accessed with the lowest latency by the processors in that NUMA node.

Q: What is remote memory in a NUMA system?

Remote memory in a NUMA system is memory that belongs to a different NUMA node. Accessing remote memory requires going through the interconnect between nodes, which adds significant latency compared to local memory access.

Q: What is CPU affinity in NUMA-aware programming?

CPU affinity in NUMA-aware programming is binding threads to specific CPU cores to control memory access patterns. This ensures that threads access memory from their local NUMA node, reducing latency and improving performance.

Q: What is memory placement in NUMA systems?

Memory placement in NUMA systems is allocating memory from specific NUMA nodes to optimize access patterns. This involves ensuring that data is allocated on the same NUMA node as the threads that will access it most frequently.

Q: What is the typical latency difference between local and remote memory access?

Remote memory access is typically 2-3x slower than local memory access in NUMA systems. This significant latency difference makes NUMA-aware programming crucial for high-performance applications.

Q: What is a NUMA-aware allocator?

A NUMA-aware allocator is an allocator that considers NUMA topology when allocating memory. It tries to allocate memory from the NUMA node that is local to the thread making the allocation request, reducing memory access latency.

Q: What is the main goal of NUMA-aware software design?

The main goal of NUMA-aware software design is to minimize remote memory accesses and maximize local memory accesses. This reduces memory access latency and improves overall application performance.

NUMA Architecture ​

The NUMA Problem ​

Traditional SMP vs NUMA ​

Why NUMA Exists ​

NUMA Node Structure ​

What is a NUMA Node? ​

Typical NUMA Configuration ​

Local Memory Access ​

Remote Memory Access ​

NUMA Performance Characteristics ​

Latency Comparison ​

Bandwidth Characteristics ​

Device Placement and Interconnect Overload ​

L3 Cache Cross-Connect and Shared Caches ​

L3 Cache Organization ​

Cross-Connect Performance Implications ​

Cache Coherency Protocol Overhead ​

Real-World Impact ​

NUMA-Aware Software Design ​

The Goal ​

First-Touch Policy ​

Explicit Memory Placement ​

Thread Binding ​

NUMA-Aware Thread Creation ​

Partitioned Data Structures ​

NUMA-Aware Allocators ​

NUMA Detection and Analysis ​

Linux Tools ​

Key Metrics ​

Tools ​

NUMA Optimization Techniques ​

NUMA-Aware Allocation ​

Memory Migration ​

NUMA-Aware Thread Creation ​

Work Distribution ​

NUMA-Friendly Data Layout ​

Cache-Line Aware Padding ​

Real-World Applications ​

High-Frequency Trading ​

Database Systems ​

Scientific Computing ​

Common NUMA Pitfalls ​

1. Ignoring NUMA Topology ​

2. Poor Memory Placement ​

3. Inefficient Thread Placement ​

4. False Sharing Across Nodes ​

Practical Guidelines ​

When to Optimize for NUMA ​

When Not to Optimize for NUMA ​

Measurement ​

Questions ​

NUMA Architecture

The NUMA Problem

Traditional SMP vs NUMA

Why NUMA Exists

NUMA Node Structure

What is a NUMA Node?

Typical NUMA Configuration

Local Memory Access

Remote Memory Access

NUMA Performance Characteristics

Latency Comparison

Bandwidth Characteristics

Device Placement and Interconnect Overload

L3 Cache Cross-Connect and Shared Caches

L3 Cache Organization

Cross-Connect Performance Implications

Cache Coherency Protocol Overhead

Real-World Impact

NUMA-Aware Software Design

The Goal

First-Touch Policy

Explicit Memory Placement

Thread Binding

NUMA-Aware Thread Creation

Partitioned Data Structures

NUMA-Aware Allocators

NUMA Detection and Analysis

Linux Tools

Key Metrics

Tools

NUMA Optimization Techniques

NUMA-Aware Allocation

Memory Migration

NUMA-Aware Thread Creation

Work Distribution

NUMA-Friendly Data Layout

Cache-Line Aware Padding

Real-World Applications

High-Frequency Trading

Database Systems

Scientific Computing

Common NUMA Pitfalls

1. Ignoring NUMA Topology

2. Poor Memory Placement

3. Inefficient Thread Placement

4. False Sharing Across Nodes

Practical Guidelines

When to Optimize for NUMA

When Not to Optimize for NUMA

Measurement

Questions