Skip to content

Memory Ordering

Memory ordering controls how atomic operations are synchronized across threads and how the compiler/hardware can reorder instructions.

The Problem: Instruction Reordering

Modern CPUs and compilers don't execute instructions in the exact order you write them. They reorder operations for performance. Additionally, threads can be spawned on separate cores, with their own L1/L2 caches. Therefore, it is important for the programmer to understand how to commit changes to memory and load values from memory in a way that is safe for multi-threaded programs. In some processor architectures, it is also possible for one core to load an incoherent value while the other core is writing to the same variable.

Why Reordering Happens

cpp
What you write:          What CPU might do:
int x = 1;              int y = 2;     // Reordered!
int y = 2;              int x = 1;

In single-threaded code: This reordering is invisible and safe. In multi-threaded code: This can cause serious bugs!

The Solution: Memory Fences

C++ allows you to tightly control "fences" around operations to ensure that any changes to shared variables are forcefully committed to memory and visible to other threads before they load them. This is done by using the std::memory_order enum.

Memory Ordering Options

The following are the different memory ordering options:

1. Relaxed Ordering (std::memory_order_relaxed)

No ordering guarantees. The compiler can reorder operations however it wants, a thread can load an incoherent value from memory, while another thread is writing value to the same variable. Diagram:

cpp
Thread 1:               Thread 2:
┌─────────────────┐     ┌─────────────────┐
│ x = 1;          │     │ y = 2;          │
│ y = 2;          │     │ x = 1;          │
│ relaxed store;  │     │ relaxed load;   │
└─────────────────┘     └─────────────────┘
     ↕️ No fences           ↕️ No guarantees
     Any order possible     May see stale data

Use when: Simple counters where order doesn't matter.

2. Acquire Ordering (std::memory_order_acquire)

Read fence - places a fence that ensures that no load operation from the current thread can be re-ordered before this fence. This ensures that all the stores to memory before the fence are visible to the current thread. Diagram:

cpp
Thread A (Producer):     Thread B (Consumer):
┌─────────────────┐     ┌─────────────────┐
│ data = 42;      │     │                 │
│ flag = true;    │     │                 │
│ RELEASE fence   │     │                 │
└─────────────────┘     └─────────────────┘
         │                       │
         │                       ▼
         │               ┌─────────────────┐
         │               │ ACQUIRE fence   │ ← FENCE HERE
         │               │ load flag;      │
         │               │ use data;       │ ← Guaranteed to see data = 42
         │               └─────────────────┘
         │                       │
         ▼                       ▼
    All writes before         All writes from
    fence are committed       producer are visible

Use when: Reading shared data that was written by another thread.

3. Release Ordering (std::memory_order_release)

Write fence - places a fence that ensures that no store operation from the current thread can be re-ordered after this fence. This ensures that all the stores to memory before this fence are properly committed to memory. Diagram:

cpp
Thread A (Producer):     Thread B (Consumer):
┌─────────────────┐     ┌─────────────────┐
│ data = 42;      │     │                 │
│ flag = true;    │     │                 │
│ RELEASE fence   │ ← FENCE HERE          │
│ store flag;     │     │                 │
└─────────────────┘     └─────────────────┘
         │                       │
         ▼                       │
    All writes before            │
    fence are committed          │
    to memory                    │
         │                       │
         │                       ▼
         │               ┌─────────────────┐
         │               │ load flag;      │
         │               │ use data;       │
         │               └─────────────────┘
         │                       │
         ▼                       ▼
    Producer commits         Consumer can
    all previous writes      see all writes

Use when: Writing data that will be read by another thread.

4. Acquire-Release (std::memory_order_acq_rel)

Combination of acquire and release - ensures that all the stores to memory before the fence are visible to the current thread, and all the loads from memory after the fence are visible to all threads. Diagram:

cpp
Thread A:                 Thread B:
┌─────────────────┐      ┌─────────────────┐
│                 │      │                 │
│                 │      │                 │
└─────────────────┘      └─────────────────┘
         │                        │
         ▼                        ▼
┌─────────────────┐      ┌─────────────────┐
│ ACQ_REL fence   │ ← Both fences here    │
fetch_add(1);   │      │                 │
│                 │      │                 │
└─────────────────┘      └─────────────────┘
         │                        │
         ▼                        ▼
    All writes before         All writes from
    fence are visible         this thread are
    to this thread            visible to others

Use when: Read-modify-write operations that need both fences.

5. Sequential Consistency (std::memory_order_seq_cst)

Strongest guarantee - all operations are ordered in a global order. This is the strongest memory ordering guarantee and the default for all atomic operations! Diagram:

cpp
Global Timeline (all threads see the same order):

Thread 1:     Thread 2:     Thread 3:
┌─────────┐   ┌─────────┐   ┌─────────┐
│ x = 1   │   │ y = 2   │   │ z = 3
│ y = 4   │   │ x = 5   │   │ x = 6
│ z = 7   │   │ z = 8   │   │ y = 9
└─────────┘   └─────────┘   └─────────┘
     │             │             │
     └─────────────┼─────────────┘

            Global Order:
            [x=1, y=2, z=3, y=4, x=5, x=6, z=7, z=8, y=9]
            All threads see this exact sequence!

Intuitive Understanding

It is intuitive to think about release as "commit to memory" where the writer thread is broadcasting its changes to all other threads. acquire is the opposite, where the reader thread is ensuring that any commits to memory before the fence are visible to it. The cost of operations increases with the strength of the memory ordering guarantee.

Specifying memory orders also helps reason about the use of the variable. For example, if your thread is never writing to a variable, you should never need to use std::memory_order_release or std::memory_order_acq_rel. Using these orders unnecessarily slows down the program and makes inference about the program harder.

Memory barriers can be specified by using the second argument to any atomic operation. For example, x.store(1, std::memory_order_release); is a memory barrier that ensures that all the stores to memory before the fence are visible to the current thread. Additionally, you can place a fence directly by using the std::atomic_thread_fence function.

Practical Examples

1. Using Acquire-Release

cpp
std::atomic<int> flag{0};
int data = 0;

// Thread A - Producer
void producer() {
    data = 42;  // Prepare data
    flag.store(1, std::memory_order_release);  // Release store, COMMIT EVERYTHING BEFORE THIS POINT TO MEMORY
}

// Thread B - Consumer  
void consumer() {
    int f = flag.load(std::memory_order_acquire);  // Acquire load, ENSURE EVERYTHING BEFORE THIS POINT IS VISIBLE TO THIS THREAD
    if (f == 1) {
        // data is guaranteed to be 42 here
        use_data(data);
    }
}

Properties:

  • Release: All previous writes are visible to threads that acquire this value
  • Acquire: All previous writes from the releasing thread are visible
  • Cheaper: Less expensive than sequential consistency
  • Common pattern: Producer-consumer relationships

2. Using Relaxed

You can use std::memory_order_relaxed to increment a counter variable. This is because the processor emits a single load-modify-store instruction for the fetch_add operation, which ensures visibility without requiring external fences.

In the below example, the ordering of +1 operations doesn't matter as long as the number of +1 operations is the same.

cpp
std::atomic<int> counter{0};

void increment() {
    counter.fetch_add(1, std::memory_order_relaxed);
}

Properties:

  • Weakest ordering: No synchronization guarantees
  • Fastest: Minimal overhead
  • Use case: Simple counters where order doesn't matter
  • Dangerous: Easy to create bugs

Common Patterns

1. Double-Checked Locking

cpp
std::atomic<bool> initialized{false};
std::mutex init_mutex;
int* singleton = nullptr;

int* get_singleton() {
    if (!initialized.load(std::memory_order_acquire)) {
        std::lock_guard<std::mutex> lock(init_mutex);
        if (!initialized.load(std::memory_order_relaxed)) {
            singleton = new int(42);
            initialized.store(true, std::memory_order_release);
        }
    }
    return singleton;
}

2. Lock-Free Counter

cpp
std::atomic<int> counter{0};

void increment() {
    counter.fetch_add(1, std::memory_order_relaxed);
}

int get_count() {
    return counter.load(std::memory_order_acquire);
}

3. Producer-Consumer with Atomics

cpp
std::atomic<int> head{0}, tail{0};
int buffer[1000];

void producer() {
    int pos = head.fetch_add(1, std::memory_order_relaxed);
    buffer[pos % 1000] = 42;
    // No ordering needed for simple counter
}

void consumer() {
    int pos = tail.fetch_add(1, std::memory_order_relaxed);
    int value = buffer[pos % 1000];
    // No ordering needed for simple counter
}

Performance Impact

Memory Ordering Costs

OrderingCostUse Case
relaxed~1nsSimple counters, flags
acquire/release~5nsSynchronization patterns
seq_cst~20nsWhen you need total ordering

When to Use Each

relaxed: When order doesn't matter

cpp
std::atomic<int> counter{0};
counter.fetch_add(1, std::memory_order_relaxed);

acquire/release: For synchronization

cpp
std::atomic<bool> ready{false};
ready.store(true, std::memory_order_release);
if (ready.load(std::memory_order_acquire)) { ... }

seq_cst: When you need total ordering

cpp
std::atomic<int> x{0}, y{0};
x.store(1, std::memory_order_seq_cst);
y.store(1, std::memory_order_seq_cst);

Debugging Memory Ordering

1. Use Tools

  • ThreadSanitizer: Detects data races
  • CppMem: Models memory ordering
  • Relacy: Race condition detection

2. Test on Weak Memory Model Architectures

  • ARM: More reordering than x86
  • PowerPC: Different memory model
  • MIPS: Another weak memory model

3. Think in Terms of Happens-Before

  • Happens-before: Formal relationship between operations
  • Synchronizes-with: Release-acquire relationships
  • Carries-a-dependency: Data dependencies

Questions

Q: What memory order is required if multiple threads need to increment a counter?

For simple counter increments, relaxed memory ordering is sufficient because the order of increments doesn't matter - only atomicity is required. Sequential consistency would work but is unnecessarily expensive.

Q: Which of the following memory orders ensures that writes are visible after the atomic operation?

std::memory_order_release ensures that all previous writes are visible to threads that perform an acquire operation on the same atomic variable. This is the key property for producer-consumer patterns.

Q: What happens if you mix relaxed and sequential consistency memory orders?

Mixing different memory orders can lead to subtle bugs because the synchronization guarantees become unclear. It's best to use consistent memory ordering within a synchronization pattern.

Q: Which memory order is the default for std::atomic operations?

std::memory_order_seq_cst is the default memory order for std::atomic operations. It provides the strongest guarantees but is also the most expensive.

Q: When should you use acquire memory order?

Acquire memory order is used when you need to ensure that all previous writes from a thread that performed a release operation are visible. This is commonly used in consumer threads.

Q: What is the performance difference between relaxed and sequential consistency?

Relaxed memory ordering is approximately 20x faster than sequential consistency because it doesn't require memory barriers or synchronization with other threads.