Memory Ordering

Memory ordering controls how atomic operations are synchronized across threads and how the compiler/hardware can reorder instructions.

The Problem: Instruction Reordering

Modern CPUs and compilers don't execute instructions in the exact order you write them. They reorder operations for performance. Additionally, threads can be spawned on separate cores, with their own L1/L2 caches. Therefore, it is important for the programmer to understand how to commit changes to memory and load values from memory in a way that is safe for multi-threaded programs. In some processor architectures, it is also possible for one core to load an incoherent value while the other core is writing to the same variable.

Why Reordering Happens

cpp

What you write:          What CPU might do:
int x = 1;              int y = 2;     // Reordered!
int y = 2;              int x = 1;

In single-threaded code: This reordering is invisible and safe. In multi-threaded code: This can cause serious bugs!

The Solution: Memory Fences

C++ allows you to tightly control "fences" around operations to ensure that any changes to shared variables are forcefully committed to memory and visible to other threads before they load them. This is done by using the std::memory_order enum.

Memory Ordering Options

The following are the different memory ordering options:

1. Relaxed Ordering (`std::memory_order_relaxed`)

No ordering guarantees. The compiler can reorder operations however it wants, a thread can load an incoherent value from memory, while another thread is writing value to the same variable. Diagram:

cpp

Thread 1:               Thread 2:
┌─────────────────┐     ┌─────────────────┐
│ x = 1;          │     │ y = 2;          │
│ y = 2;          │     │ x = 1;          │
│ relaxed store;  │     │ relaxed load;   │
└─────────────────┘     └─────────────────┘
     ↕️ No fences           ↕️ No guarantees
     Any order possible     May see stale data

Use when: Simple counters where order doesn't matter.

2. Acquire Ordering (`std::memory_order_acquire`)

Read fence - places a fence that ensures that no load operation from the current thread can be re-ordered before this fence. This ensures that all the stores to memory before the fence are visible to the current thread. Diagram:

cpp

Thread A (Producer):     Thread B (Consumer):
┌─────────────────┐     ┌─────────────────┐
│ data = 42;      │     │                 │
│ flag = true;    │     │                 │
│ RELEASE fence   │     │                 │
└─────────────────┘     └─────────────────┘
         │                       │
         │                       ▼
         │               ┌─────────────────┐
         │               │ ACQUIRE fence   │ ← FENCE HERE
         │               │ load flag;      │
         │               │ use data;       │ ← Guaranteed to see data = 42
         │               └─────────────────┘
         │                       │
         ▼                       ▼
    All writes before         All writes from
    fence are committed       producer are visible

Use when: Reading shared data that was written by another thread.

3. Release Ordering (`std::memory_order_release`)

Write fence - places a fence that ensures that no store operation from the current thread can be re-ordered after this fence. This ensures that all the stores to memory before this fence are properly committed to memory. Diagram:

cpp

Thread A (Producer):     Thread B (Consumer):
┌─────────────────┐     ┌─────────────────┐
│ data = 42;      │     │                 │
│ flag = true;    │     │                 │
│ RELEASE fence   │ ← FENCE HERE          │
│ store flag;     │     │                 │
└─────────────────┘     └─────────────────┘
         │                       │
         ▼                       │
    All writes before            │
    fence are committed          │
    to memory                    │
         │                       │
         │                       ▼
         │               ┌─────────────────┐
         │               │ load flag;      │
         │               │ use data;       │
         │               └─────────────────┘
         │                       │
         ▼                       ▼
    Producer commits         Consumer can
    all previous writes      see all writes

Use when: Writing data that will be read by another thread.

4. Acquire-Release (`std::memory_order_acq_rel`)

Combination of acquire and release - ensures that all the stores to memory before the fence are visible to the current thread, and all the loads from memory after the fence are visible to all threads. Diagram:

cpp

Thread A:                 Thread B:
┌─────────────────┐      ┌─────────────────┐
│                 │      │                 │
│                 │      │                 │
└─────────────────┘      └─────────────────┘
         │                        │
         ▼                        ▼
┌─────────────────┐      ┌─────────────────┐
│ ACQ_REL fence   │ ← Both fences here    │
│ fetch_add(1);   │      │                 │
│                 │      │                 │
└─────────────────┘      └─────────────────┘
         │                        │
         ▼                        ▼
    All writes before         All writes from
    fence are visible         this thread are
    to this thread            visible to others

Use when: Read-modify-write operations that need both fences.

5. Sequential Consistency (`std::memory_order_seq_cst`)

Strongest guarantee - all operations are ordered in a global order. This is the strongest memory ordering guarantee and the default for all atomic operations! Diagram:

cpp

Global Timeline (all threads see the same order):

Thread 1:     Thread 2:     Thread 3:
┌─────────┐   ┌─────────┐   ┌─────────┐
│ x = 1   │   │ y = 2   │   │ z = 3   │
│ y = 4   │   │ x = 5   │   │ x = 6   │
│ z = 7   │   │ z = 8   │   │ y = 9   │
└─────────┘   └─────────┘   └─────────┘
     │             │             │
     └─────────────┼─────────────┘
                   ▼
            Global Order:
            [x=1, y=2, z=3, y=4, x=5, x=6, z=7, z=8, y=9]
            All threads see this exact sequence!

Intuitive Understanding

It is intuitive to think about release as "commit to memory" where the writer thread is broadcasting its changes to all other threads. acquire is the opposite, where the reader thread is ensuring that any commits to memory before the fence are visible to it. The cost of operations increases with the strength of the memory ordering guarantee.

Specifying memory orders also helps reason about the use of the variable. For example, if your thread is never writing to a variable, you should never need to use std::memory_order_release or std::memory_order_acq_rel. Using these orders unnecessarily slows down the program and makes inference about the program harder.

Memory barriers can be specified by using the second argument to any atomic operation. For example, x.store(1, std::memory_order_release); is a memory barrier that ensures that all the stores to memory before the fence are visible to the current thread. Additionally, you can place a fence directly by using the std::atomic_thread_fence function.

Practical Examples

1. Using Acquire-Release

cpp

std::atomic<int> flag{0};
int data = 0;

// Thread A - Producer
void producer() {
    data = 42;  // Prepare data
    flag.store(1, std::memory_order_release);  // Release store, COMMIT EVERYTHING BEFORE THIS POINT TO MEMORY
}

// Thread B - Consumer  
void consumer() {
    int f = flag.load(std::memory_order_acquire);  // Acquire load, ENSURE EVERYTHING BEFORE THIS POINT IS VISIBLE TO THIS THREAD
    if (f == 1) {
        // data is guaranteed to be 42 here
        use_data(data);
    }
}

Properties:

Release: All previous writes are visible to threads that acquire this value
Acquire: All previous writes from the releasing thread are visible
Cheaper: Less expensive than sequential consistency
Common pattern: Producer-consumer relationships

2. Using Relaxed

You can use std::memory_order_relaxed to increment a counter variable. This is because the processor emits a single load-modify-store instruction for the fetch_add operation, which ensures visibility without requiring external fences.

In the below example, the ordering of +1 operations doesn't matter as long as the number of +1 operations is the same.

cpp

std::atomic<int> counter{0};

void increment() {
    counter.fetch_add(1, std::memory_order_relaxed);
}

Properties:

Weakest ordering: No synchronization guarantees
Fastest: Minimal overhead
Use case: Simple counters where order doesn't matter
Dangerous: Easy to create bugs

Common Patterns

1. Double-Checked Locking

cpp

std::atomic<bool> initialized{false};
std::mutex init_mutex;
int* singleton = nullptr;

int* get_singleton() {
    if (!initialized.load(std::memory_order_acquire)) {
        std::lock_guard<std::mutex> lock(init_mutex);
        if (!initialized.load(std::memory_order_relaxed)) {
            singleton = new int(42);
            initialized.store(true, std::memory_order_release);
        }
    }
    return singleton;
}

2. Lock-Free Counter

cpp

std::atomic<int> counter{0};

void increment() {
    counter.fetch_add(1, std::memory_order_relaxed);
}

int get_count() {
    return counter.load(std::memory_order_acquire);
}

3. Producer-Consumer with Atomics

cpp

std::atomic<int> head{0}, tail{0};
int buffer[1000];

void producer() {
    int pos = head.fetch_add(1, std::memory_order_relaxed);
    buffer[pos % 1000] = 42;
    // No ordering needed for simple counter
}

void consumer() {
    int pos = tail.fetch_add(1, std::memory_order_relaxed);
    int value = buffer[pos % 1000];
    // No ordering needed for simple counter
}

Performance Impact

Memory Ordering Costs

Ordering	Cost	Use Case
relaxed	~1ns	Simple counters, flags
acquire/release	~5ns	Synchronization patterns
seq_cst	~20ns	When you need total ordering

When to Use Each

relaxed: When order doesn't matter

cpp

std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

acquire/release: For synchronization

cpp

std::atomic<bool> ready{false};
ready.store(true, std::memory_order_release);
if (ready.load(std::memory_order_acquire)) { ... }

seq_cst: When you need total ordering

cpp

std::atomic<int> x{0}, y{0};
x.store(1, std::memory_order_seq_cst);
y.store(1, std::memory_order_seq_cst);

Debugging Memory Ordering

1. Use Tools

ThreadSanitizer: Detects data races
CppMem: Models memory ordering
Relacy: Race condition detection

2. Test on Weak Memory Model Architectures

ARM: More reordering than x86
PowerPC: Different memory model
MIPS: Another weak memory model

3. Think in Terms of Happens-Before

Happens-before: Formal relationship between operations
Synchronizes-with: Release-acquire relationships
Carries-a-dependency: Data dependencies

Questions

Q: What memory order is required if multiple threads need to increment a counter?

For simple counter increments, relaxed memory ordering is sufficient because the order of increments doesn't matter - only atomicity is required. Sequential consistency would work but is unnecessarily expensive.

Q: Which of the following memory orders ensures that writes are visible after the atomic operation?

std::memory_order_release ensures that all previous writes are visible to threads that perform an acquire operation on the same atomic variable. This is the key property for producer-consumer patterns.

Q: What happens if you mix relaxed and sequential consistency memory orders?

Mixing different memory orders can lead to subtle bugs because the synchronization guarantees become unclear. It's best to use consistent memory ordering within a synchronization pattern.

Q: Which memory order is the default for std::atomic operations?

std::memory_order_seq_cst is the default memory order for std::atomic operations. It provides the strongest guarantees but is also the most expensive.

Q: When should you use acquire memory order?

Acquire memory order is used when you need to ensure that all previous writes from a thread that performed a release operation are visible. This is commonly used in consumer threads.

Q: What is the performance difference between relaxed and sequential consistency?

Relaxed memory ordering is approximately 20x faster than sequential consistency because it doesn't require memory barriers or synchronization with other threads.

Memory Ordering ​

The Problem: Instruction Reordering ​

Why Reordering Happens ​

The Solution: Memory Fences ​

Memory Ordering Options ​

1. Relaxed Ordering (std::memory_order_relaxed) ​

2. Acquire Ordering (std::memory_order_acquire) ​

3. Release Ordering (std::memory_order_release) ​

4. Acquire-Release (std::memory_order_acq_rel) ​

5. Sequential Consistency (std::memory_order_seq_cst) ​

Intuitive Understanding ​

Practical Examples ​

1. Using Acquire-Release ​

2. Using Relaxed ​

Common Patterns ​

1. Double-Checked Locking ​

2. Lock-Free Counter ​

3. Producer-Consumer with Atomics ​

Performance Impact ​

Memory Ordering Costs ​

When to Use Each ​

Debugging Memory Ordering ​

1. Use Tools ​

2. Test on Weak Memory Model Architectures ​

3. Think in Terms of Happens-Before ​

Questions ​