Skip to content

NUMA Architecture

Video: Non-Uniform Memory Architecture (NUMA): A Nearly Unfathomable Morass of Arcana - Fedor Pikus CppNow

NUMA stands for Non-Uniform Memory Access. It's a computer memory design where memory access times depend on the memory location relative to the processor. In NUMA systems, memory that is local to a processor can be accessed much faster than memory that is remote (belonging to a different processor).

The NUMA Problem

Traditional SMP vs NUMA

In traditional Symmetric Multiprocessing (SMP) systems, all processors share a single memory bus, and all memory accesses have the same latency regardless of which processor makes the access.

In NUMA systems, each processor has its own local memory, and accessing remote memory requires going through an interconnect, which adds significant latency.

Why NUMA Exists

NUMA architectures were developed to solve the memory bandwidth bottleneck in large multi-socket systems. As the number of processors increased, the shared memory bus became a performance bottleneck. NUMA provides:

  • Higher memory bandwidth: Each node has its own memory controller
  • Lower latency: Local memory access is faster
  • Better scalability: Performance scales with the number of nodes

NUMA Node Structure

What is a NUMA Node?

A NUMA node is a group of processors (CPUs) and their local memory that can be accessed with uniform latency. Each NUMA node contains:

  • One or more CPU sockets
  • Associated CPU cores
  • Local memory directly connected to those sockets
  • Memory controller for the local memory

Typical NUMA Configuration

cpp
NUMA Node 0                    NUMA Node 1
┌─────────────────┐           ┌─────────────────┐
│   CPU Socket 0  │           │   CPU Socket 1
│   ┌─────────┐   │           │   ┌─────────┐   │
│   │ Core 0  │   │           │   │ Core 4  │   │
│   │ Core 1  │   │           │   │ Core 5  │   │
│   │ Core 2  │   │           │   │ Core 6  │   │
│   │ Core 3  │   │           │   │ Core 7  │   │
│   └─────────┘   │           │   └─────────┘   │
│                 │           │                 │
│   Local Memory  │◄─────────►│   Local Memory  │
│   (32GB)       │           │   (32GB)       │
└─────────────────┘           └─────────────────┘

Local Memory Access

  • Source: Processor accessing its own local memory
  • Latency: ~100-200 nanoseconds
  • Bandwidth: Full memory controller bandwidth
  • Path: Direct connection through local memory controller

Remote Memory Access

  • Source: Processor accessing memory from another node
  • Latency: ~200-400 nanoseconds (2-3x slower)
  • Bandwidth: Limited by interconnect bandwidth
  • Path: Through interconnect to remote memory controller

NUMA Performance Characteristics

Latency Comparison

Access TypeLatencyRelative Cost
L1 Cache Hit~1-2 cycles1x
L2 Cache Hit~10-20 cycles5-10x
L3 Cache Hit~40-80 cycles20-40x
Local Memory~100-200 cycles50-100x
Remote Memory~200-400 cycles100-200x

Bandwidth Characteristics

  • Local memory bandwidth: Full memory controller bandwidth (e.g., 50-100 GB/s)
  • Remote memory bandwidth: Limited by interconnect (e.g., 20-40 GB/s)
  • Interconnect bandwidth: Shared among all nodes

Device Placement and Interconnect Overload

Devices such as network cards, storage controllers, and GPUs are often mounted on specific NUMA nodes. When a device is mounted on a different NUMA node than the CPU processing the data, all device I/O operations must traverse the interconnect, potentially causing:

  • Interconnect congestion: High-bandwidth devices can saturate the cross-node link
  • Increased latency: Device access requires remote memory operations
  • Reduced bandwidth: Shared interconnect bandwidth limits overall performance
cpp
// Example: Network card on NUMA node 1, processing on NUMA node 0
// All network data must cross the interconnect
void process_network_data() {
    // Network buffer allocated on device's NUMA node (node 1)
    void* network_buffer = numa_alloc_onnode(buffer_size, 1);

    // Processing happens on CPU node 0
    // Data must be transferred across interconnect
    process_data_on_node(network_buffer, buffer_size, 0);
}

L3 Cache Cross-Connect and Shared Caches

Modern multi-socket systems use a shared L3 cache architecture where the last-level cache (LLC) is distributed across NUMA nodes but accessible to all nodes through a high-speed interconnect.

L3 Cache Organization

cpp
NUMA Node 0                    NUMA Node 1
┌─────────────────┐           ┌─────────────────┐
│   CPU Socket 0  │           │   CPU Socket 1
│   ┌─────────┐   │           │   ┌─────────┐   │
│   │ Core 0  │   │           │   │ Core 4  │   │
│   │ Core 1  │   │           │   │ Core 5  │   │
│   │ Core 2  │   │           │   │ Core 6  │   │
│   │ Core 3  │   │           │   │ Core 7  │   │
│   └─────────┘   │           │   └─────────┘   │
│                 │           │                 │
│   L3 Cache      │◄─────────►│   L3 Cache      │
│   (Slice 0)     │           │   (Slice 1)     │
└─────────────────┘           └─────────────────┘

Cross-Connect Performance Implications

The L3 cache cross-connect can become a performance bottleneck when:

  1. Cache Coherency Traffic: Multiple nodes accessing the same cache lines
  2. Cache Misses: Remote L3 cache accesses require interconnect traversal
  3. Bandwidth Saturation: High memory bandwidth applications overwhelm the cross-connect
cpp
// Example: False sharing across NUMA nodes
struct SharedData {
    int data[2];  // Adjacent cache lines
};

// Thread on node 0 writes to data[0]
// Thread on node 1 writes to data[1]
// Both cause cache coherency traffic across interconnect

Cache Coherency Protocol Overhead

The cache coherency protocol (MESI/MOESI) generates significant traffic across the interconnect:

  • Invalidation messages: When one node modifies a cache line
  • Snoop requests: Checking if other nodes have copies of cache lines
  • Data transfers: Moving cache lines between nodes
cpp
// Example: Cache line ping-pong between nodes
void cache_line_ping_pong() {
    volatile int shared_counter = 0;

    // Thread on node 0
    #pragma omp parallel for
    for (int i = 0; i < 1000000; i++) {
        shared_counter++;  // Causes cache line transfers between nodes
    }
}

Real-World Impact

In high-frequency trading and systems programming:

  • Latency sensitivity: Every nanosecond of memory access matters
  • Memory-bound applications: Performance limited by memory access
  • Multi-threaded workloads: Threads accessing remote memory
  • Device I/O: Network and storage devices on different NUMA nodes
  • Cache contention: Shared cache lines causing interconnect traffic

NUMA-Aware Software Design

The Goal

The main goal of NUMA-aware software design is to minimize remote memory accesses and maximize local memory accesses. This involves:

  1. Memory placement: Allocating memory on the correct NUMA node
  2. CPU affinity: Binding threads to specific CPU cores
  3. Data distribution: Organizing data to minimize cross-node access

First-Touch Policy

Most operating systems use a first-touch policy for memory allocation:

  • Memory is allocated on the NUMA node where it's first accessed
  • The thread that first touches a memory page determines its location

Explicit Memory Placement

cpp
// Linux: Using numa_alloc_onnode
void* local_memory = numa_alloc_onnode(size, node_id);

// Windows: Using VirtualAllocExNuma
void* local_memory = VirtualAllocExNuma(
    GetCurrentProcess(), NULL, size,
    MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE, node_id
);

Thread Binding

Binding threads to specific CPU cores ensures they access local memory:

cpp
// Linux: Using pthread_setaffinity_np
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(core_id, &cpuset);
pthread_setaffinity_np(thread_id, sizeof(cpu_set_t), &cpuset);

// Windows: Using SetThreadAffinityMask
SetThreadAffinityMask(GetCurrentThread(), (1ULL << core_id));

NUMA-Aware Thread Creation

cpp
// Create threads on specific NUMA nodes
for (int node = 0; node < num_nodes; node++) {
    for (int core = 0; core < cores_per_node; core++) {
        int global_core = node * cores_per_node + core;
        create_thread_on_core(global_core);
    }
}

Partitioned Data Structures

cpp
// Each NUMA node owns a portion of the data
struct PartitionedArray {
    int* local_data[NUM_NODES];
    size_t local_size[NUM_NODES];
};

// Allocate local portions
for (int node = 0; node < NUM_NODES; node++) {
    array->local_data[node] = numa_alloc_onnode(
        local_size, node
    );
}

NUMA-Aware Allocators

cpp
// Custom allocator that considers NUMA topology
class NumaAllocator {
public:
    void* allocate(size_t size) {
        int current_node = get_current_numa_node();
        return numa_alloc_onnode(size, current_node);
    }

    void deallocate(void* ptr) {
        numa_free(ptr, size);
    }
};

NUMA Detection and Analysis

Linux Tools

cpp
# View NUMA topology
numactl --hardware

# Show NUMA statistics
cat /proc/zoneinfo | grep -E "Node|pages"

# Monitor NUMA access patterns
perf stat -e numa:* ./your_program

Key Metrics

  • NUMA hits: Local memory accesses
  • NUMA misses: Remote memory accesses
  • Memory bandwidth: Per-node bandwidth utilization
  • Interconnect traffic: Cross-node communication

Tools

  • Linux: perf, numastat, numad
  • Windows: Performance Monitor, Resource Monitor
  • Intel: VTune Profiler
  • AMD: AMD μProf

NUMA Optimization Techniques

NUMA-Aware Allocation

cpp
// Allocate memory on the current NUMA node
void* allocate_local(size_t size) {
    int node = numa_node_of_cpu(sched_getcpu());
    return numa_alloc_onnode(size, node);
}

// Allocate memory on a specific NUMA node
void* allocate_on_node(size_t size, int node) {
    return numa_alloc_onnode(size, node);
}

Memory Migration

cpp
// Move memory to a different NUMA node
int migrate_memory(void* ptr, size_t size, int target_node) {
    return numa_move_pages(0, 1, &ptr, &target_node, NULL, 0);
}

NUMA-Aware Thread Creation

cpp
// Create threads on specific NUMA nodes
void create_numa_threads(int num_threads) {
    for (int i = 0; i < num_threads; i++) {
        int node = i % numa_num_configured_nodes();
        int core = get_first_core_on_node(node);

        pthread_t thread;
        pthread_create(&thread, NULL, worker_function, NULL);
        bind_thread_to_core(thread, core);
    }
}

Work Distribution

cpp
// Distribute work based on NUMA topology
void distribute_work_numa(WorkItem* items, int num_items) {
    int num_nodes = numa_num_configured_nodes();
    int items_per_node = num_items / num_nodes;

    for (int node = 0; node < num_nodes; node++) {
        int start = node * items_per_node;
        int end = (node == num_nodes - 1) ? num_items : start + items_per_node;

        // Assign work to threads on this node
        assign_work_to_node(items + start, end - start, node);
    }
}

NUMA-Friendly Data Layout

cpp
// Structure of Arrays (SoA) for NUMA
struct NumaFriendlyData {
    // Each array allocated on a different NUMA node
    float* x[NUM_NODES];
    float* y[NUM_NODES];
    float* z[NUM_NODES];

    // Thread-local storage
    struct ThreadData {
        int node_id;
        int local_start;
        int local_end;
    } threads[MAX_THREADS];
};

Cache-Line Aware Padding

cpp
// Prevent false sharing across NUMA nodes
struct NumaAlignedData {
    alignas(64) int data[NUM_NODES][CACHE_LINE_SIZE / sizeof(int)];
};

Real-World Applications

High-Frequency Trading

In HFT systems, NUMA awareness is critical:

  • Latency sensitivity: Every nanosecond of memory access matters
  • Predictable performance: Consistent memory access patterns
  • Multi-threaded workloads: Market data processing across nodes

Database Systems

Database systems benefit from NUMA optimization:

  • Memory-bound workloads: Large datasets distributed across nodes
  • Concurrent access: Multiple threads accessing different data partitions
  • Buffer pool management: NUMA-aware buffer allocation

Scientific Computing

Scientific applications use NUMA effectively:

  • Large datasets: Distributed across multiple nodes
  • Parallel algorithms: Work distribution based on NUMA topology
  • Memory bandwidth: Maximizing local memory bandwidth

Common NUMA Pitfalls

1. Ignoring NUMA Topology

  • Problem: Not considering NUMA layout when designing software
  • Solution: Profile and optimize for NUMA characteristics

2. Poor Memory Placement

  • Problem: Memory allocated on wrong NUMA nodes
  • Solution: Use NUMA-aware allocators and explicit placement

3. Inefficient Thread Placement

  • Problem: Threads accessing remote memory
  • Solution: Bind threads to appropriate CPU cores

4. False Sharing Across Nodes

  • Problem: Cache lines shared between NUMA nodes
  • Solution: Use proper padding and data layout

Practical Guidelines

When to Optimize for NUMA

  • Multi-socket systems: Systems with multiple CPU sockets
  • Memory-bound applications: Applications limited by memory bandwidth
  • Latency-sensitive workloads: Where memory access time matters
  • Large datasets: Applications processing large amounts of data

When Not to Optimize for NUMA

  • Single-socket systems: No NUMA topology to optimize for
  • CPU-bound applications: Not limited by memory access
  • Small datasets: Overhead exceeds benefits
  • Simple applications: Where optimization complexity isn't justified

Measurement

Always measure NUMA performance before and after optimization:

cpp
#include <chrono>
#include <numa.h>

// Measure memory access performance
auto start = std::chrono::high_resolution_clock::now();
// ... memory access code ...
auto end = std::chrono::high_resolution_clock::now();

auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::cout << "Memory access time: " << duration.count() << " nanoseconds\n";

// Check NUMA statistics
struct bitmask* nodemask = numa_get_run_node_mask();
std::cout << "Running on NUMA node: " << numa_node_of_cpu(sched_getcpu()) << std::endl;

Understanding NUMA architecture and implementing NUMA-aware software design is essential for achieving optimal performance on modern multi-socket systems, especially in latency-sensitive applications like high-frequency trading.

Questions

Q: What does NUMA stand for?

NUMA stands for Non-Uniform Memory Access. It describes a computer memory design where memory access times depend on the memory location relative to the processor. Memory that is local to a processor can be accessed faster than memory that is remote.

Q: What is a NUMA node?

A NUMA node is a group of processors (CPUs) and their local memory that can be accessed with uniform latency. Each NUMA node contains one or more CPU sockets, their associated cores, and the memory directly connected to those sockets.

Q: What is the main performance issue with NUMA systems?

The main performance issue with NUMA systems is that memory access latency varies depending on whether the memory is local or remote to the processor. Local memory access is much faster than remote memory access, which can cause significant performance degradation if not managed properly.

Q: What is local memory in a NUMA system?

Local memory in a NUMA system is memory that is directly connected to a specific processor socket. This memory can be accessed with the lowest latency by the processors in that NUMA node.

Q: What is remote memory in a NUMA system?

Remote memory in a NUMA system is memory that belongs to a different NUMA node. Accessing remote memory requires going through the interconnect between nodes, which adds significant latency compared to local memory access.

Q: What is CPU affinity in NUMA-aware programming?

CPU affinity in NUMA-aware programming is binding threads to specific CPU cores to control memory access patterns. This ensures that threads access memory from their local NUMA node, reducing latency and improving performance.

Q: What is memory placement in NUMA systems?

Memory placement in NUMA systems is allocating memory from specific NUMA nodes to optimize access patterns. This involves ensuring that data is allocated on the same NUMA node as the threads that will access it most frequently.

Q: What is the typical latency difference between local and remote memory access?

Remote memory access is typically 2-3x slower than local memory access in NUMA systems. This significant latency difference makes NUMA-aware programming crucial for high-performance applications.

Q: What is a NUMA-aware allocator?

A NUMA-aware allocator is an allocator that considers NUMA topology when allocating memory. It tries to allocate memory from the NUMA node that is local to the thread making the allocation request, reducing memory access latency.

Q: What is the main goal of NUMA-aware software design?

The main goal of NUMA-aware software design is to minimize remote memory accesses and maximize local memory accesses. This reduces memory access latency and improves overall application performance.