Memory Ordering və Consistency
Memory Ordering Nədir?
Memory Ordering - müxtəlif thread-lərin memory əməliyyatlarının hansı sıra ilə görünməsini müəyyənləşdirir.
Problem: Reordering
Modern CPU-lar və compiler-lər performans üçün təlimatları yenidən sıralaya bilər.
// Source code
int data = 0;
int flag = 0;
// Thread 1
data = 42; // Write 1
flag = 1; // Write 2
// Thread 2
if (flag == 1) { // Read 2
assert(data == 42); // Read 1 - Təmin edərmi?
}
sequenceDiagram
participant T1 as Thread 1
participant Mem as Memory
participant T2 as Thread 2
Note over T1: CPU reorder edə bilər!
T1->>Mem: flag = 1 (Write 2 first!)
T2->>Mem: Read flag (1)
T2->>Mem: Read data (0 - köhnə!)
T1->>Mem: data = 42 (Write 1 late)
Note over T2: Assert fails!
Memory Reordering Növləri
1. Compiler Reordering
Compiler optimization zamanı.
// Original
x = 1;
y = 2;
// Compiler reorder edə bilər
y = 2;
x = 1;
2. CPU Reordering
Hardware səviyyəsində.
Store Buffer:
- CPU write-ı buffer-ə yazır
- Sonra memory-ə flush edir
- Digər CPU-lar gecikmə ilə görür
Load/Store Reordering:
- Load-lar store-dan qabaq icra oluna bilər
- Independent store-lar reorder oluna bilər
3. Cache Coherence Delays
Multi-core sistemdə cache sync gecikirdi.
graph LR
CPU1[CPU 1] --> Cache1[Cache 1]
CPU2[CPU 2] --> Cache2[Cache 2]
Cache1 -->|Delayed| Memory[Shared Memory]
Cache2 -->|Delayed| Memory
Memory --> Cache2
Memory Consistency Models
Sequential Consistency (SC)
Ən güclü model - bütün əməliyyatlar proqram sırasında görünür.
graph LR
T1_W1[T1: x=1] --> T1_W2[T1: y=2]
T2_R1[T2: r1=y] --> T2_R2[T2: r2=x]
T1_W2 -.->|happens-before| T2_R1
Qaydalar:
- Hər thread-in əməliyyatları proqram sırasında
- Global total order var
- Load dərhal ən son write-ı görür
Çatışmazlıq: Yavaşdır (az optimization)
Total Store Ordering (TSO)
x86/x86-64 modeli
İcazə verilir:
- Store → Load reordering
- Load → Load reordering (eyni address-dən başqa)
- Load → Store reordering
İcazə verilmir:
- Store → Store reordering (eyni address-ə)
- Load → Load reordering (eyni address-ə)
graph TB
subgraph "TSO - x86"
S1[Store] --> S2[Store]
L1[Load] --> L2[Load]
S1 -.->|Can reorder| L1
end
Weak Ordering / Relaxed Consistency
ARM, PowerPC, RISC-V
Demək olar ki, hər şey reorder ola bilər!
graph TB
subgraph "Weak Ordering"
S1[Store 1]
S2[Store 2]
L1[Load 1]
L2[Load 2]
S1 -.-> S2
S1 -.-> L1
S1 -.-> L2
S2 -.-> L1
L1 -.-> L2
end
Note[All can be reordered!]
Üstünlük: Maksimum performans Çatışmazlıq: Programmer-ə çətin, explicit barrier lazımdır
Memory Barriers (Fences)
Memory barrier - CPU-ya təlimatları reorder etməməsini bildirir.
Barrier Növləri
1. Full Memory Barrier (mfence)
Bütün memory əməliyyatlarını sinxronlaşdırır.
// x86
store(x, 1);
mfence(); // Full barrier
store(y, 1);
sequenceDiagram
participant Before
participant Barrier
participant After
Before->>Barrier: All stores before
Barrier->>After: Complete before continuing
After->>After: All stores after
2. Store Barrier (sfence)
Yalnız store əməliyyatları.
// x86
store(x, 1);
store(y, 2);
sfence(); // Store barrier
store(z, 3);
Təmin edir: x və y, z-dən əvvəl görünür.
3. Load Barrier (lfence)
Yalnız load əməliyyatları.
// x86
load(a);
load(b);
lfence(); // Load barrier
load(c);
4. Acquire Barrier
Load-dan sonrakı əməliyyatları bloklar.
// Load with acquire
int value = atomic_load_acquire(&flag);
// Sonrakı loads/stores reorder olmaz
graph LR
Load[Load Acquire] --> Barrier[|||]
Barrier --> After[Later Operations]
Note[Cannot move after<br/>operations before barrier]
5. Release Barrier
Store-dan əvvəlki əməliyyatları bloklar.
// Store with release
atomic_store_release(&flag, 1);
graph LR
Before[Earlier Operations] --> Barrier[|||]
Barrier --> Store[Store Release]
Note[Cannot move before<br/>operations after barrier]
Store Buffer və Load Buffer
Store Buffer
CPU store-ları buffer-ə yazır, sonra memory-ə flush edir.
graph LR
CPU[CPU Core] --> SB[Store Buffer]
SB --> Cache[Cache]
Cache --> Memory[Main Memory]
SB -.->|Bypass| CPU
Store Forwarding: CPU öz store buffer-indən oxuya bilir.
Nümunə:
// Thread 1
x = 1; // Store buffer-ə gedir
r1 = x; // Store forwarding - 1 oxuyur
Load Buffer
Speculative load-lar üçün.
graph LR
CPU[CPU Core] --> LB[Load Buffer]
LB --> Cache[Cache]
LB -.->|Speculative| CPU
C/C++ Memory Order
C11 və C++11 atomic operations.
memory_order_relaxed
Heç bir sinxronizasiya yoxdur.
atomic<int> x(0);
x.store(1, memory_order_relaxed); // Reorder oluna bilər
memory_order_acquire / memory_order_release
Acquire-release semantics.
// Producer (Thread 1)
data = 42;
flag.store(1, memory_order_release); // Release barrier
// Consumer (Thread 2)
if (flag.load(memory_order_acquire) == 1) { // Acquire barrier
assert(data == 42); // OK!
}
sequenceDiagram
participant T1 as Producer
participant Mem as Memory
participant T2 as Consumer
T1->>Mem: data = 42
T1->>Mem: flag = 1 (release)
Note over T1,Mem: Release ensures<br/>data visible before flag
T2->>Mem: flag = 1? (acquire)
Note over Mem,T2: Acquire ensures<br/>flag visible before data read
T2->>Mem: read data (42) ✓
memory_order_seq_cst
Sequential consistency (default).
atomic<int> x(0);
x.store(1); // memory_order_seq_cst (default)
Ən güclü, amma ən yavaş.
memory_order_consume
Deprecated, acquire kimi.
memory_order_acq_rel
Load-acquire və store-release.
x.fetch_add(1, memory_order_acq_rel);
Architecture-Specific Barriers
x86/x86-64
mfence ; Full memory barrier
sfence ; Store fence
lfence ; Load fence
lock ; Prefix for atomic operations
x86 TSO model-i strong-dur, az barrier lazımdır.
ARM
dmb ; Data Memory Barrier
dsb ; Data Synchronization Barrier
isb ; Instruction Synchronization Barrier
; Variants
dmb sy ; Full system barrier
dmb ish ; Inner shareable
dmb st ; Store only
ARM64
; Load-acquire / Store-release instructions
ldar x0, [x1] ; Load-acquire
stlr x0, [x1] ; Store-release
RISC-V
fence ; Memory fence
fence.i ; Instruction fence
; Fine-grained
fence w,r ; Write → Read fence
fence r,w ; Read → Write fence
Volatile Keyword
C/C++ volatile
volatile int flag = 0;
// Compiler reordering qarşısını almır
// Hardware reordering qarşısını almır
// Yalnız optimization qarşısını alır
Nə edir:
- Compiler cache-də saxlamır
- Hər dəfə memory-dən oxuyur
- Optimization qarşısını alır
Nə etmir:
- Memory ordering təmin etmir
- Atomicity təmin etmir
- Multi-threading üçün kifayət deyil
Java volatile
Java-da daha güclüdür!
volatile int flag = 0;
// Acquire-release semantics
// Happens-before relationship
// Visibility guarantee
Java volatile = C++ atomic with seq_cst
Praktik Nümunələr
Double-Checked Locking (Broken without barriers)
// BROKEN!
Singleton* Singleton::getInstance() {
if (instance == nullptr) { // Read 1
lock();
if (instance == nullptr) { // Read 2
instance = new Singleton(); // Write
}
unlock();
}
return instance;
}
Problem: new üç addım:
- Memory allocate
- Constructor call
- Pointer assign
CPU reorder edə bilər: 1 → 3 → 2
Həll:
// C++11
std::atomic<Singleton*> instance;
Singleton* Singleton::getInstance() {
Singleton* tmp = instance.load(memory_order_acquire);
if (tmp == nullptr) {
lock();
tmp = instance.load(memory_order_relaxed);
if (tmp == nullptr) {
tmp = new Singleton();
instance.store(tmp, memory_order_release);
}
unlock();
}
return tmp;
}
Dekker's Algorithm (Mutual Exclusion)
// Shared
int flag[2] = {0, 0};
int turn = 0;
// Thread 0
void thread0() {
flag[0] = 1; // I want to enter
while (flag[1]) { // Other wants to enter?
if (turn != 0) {
flag[0] = 0; // Back off
while (turn != 0); // Wait
flag[0] = 1; // Try again
}
}
// Critical section
turn = 1; // Give turn
flag[0] = 0; // Exit
}
Weak model-də barrier lazımdır!
Peterson's Algorithm
// Shared
atomic<int> flag[2];
atomic<int> turn;
// Thread 0
flag[0].store(1, memory_order_release);
turn.store(1, memory_order_release);
while (flag[1].load(memory_order_acquire) &&
turn.load(memory_order_acquire) == 1);
// Critical section
flag[0].store(0, memory_order_release);
Performance Impact
Barrier Cost
| Architecture | Full Fence Cost |
|---|---|
| x86-64 | ~20-30 cycles |
| ARM | ~40-50 cycles |
| PowerPC | ~50-100 cycles |
Model Comparison
graph LR
SC[Sequential<br/>Consistency] --> TSO[TSO<br/>x86]
TSO --> RMO[Relaxed<br/>ARM/PowerPC]
SC -.->|Strongest| Note1[Slowest]
RMO -.->|Weakest| Note2[Fastest]
Best Practices
1. Use High-Level Primitives
// GOOD: Use standard library
std::mutex m;
std::atomic<int> counter;
// BAD: Manual barriers (unless necessary)
__sync_synchronize();
2. Minimize Synchronization
// BAD: Excessive sync
for (int i = 0; i < N; i++) {
atomic_increment(&counter); // Barrier per iteration!
}
// GOOD: Local accumulation
int local = 0;
for (int i = 0; i < N; i++) {
local++;
}
atomic_add(&counter, local); // One barrier
3. Acquire-Release Pattern
// Producer
data.store(value, memory_order_relaxed);
ready.store(true, memory_order_release);
// Consumer
while (!ready.load(memory_order_acquire));
auto value = data.load(memory_order_relaxed);
4. Document Assumptions
// This code assumes x86 TSO model!
// On ARM, explicit barriers needed!
store(x, 1);
store(y, 1); // Visible in order on x86
Debugging Memory Ordering
Tools
ThreadSanitizer (TSan):
g++ -fsanitize=thread -g program.cpp
./a.out
Relacy Race Detector:
- Systematic testing
- Simulates all possible reorderings
Spin / Promela:
- Model checking
- Exhaustive state exploration
Common Bugs
- Missing barriers
- Wrong memory order
- Assuming sequential consistency
- Forgetting volatile ≠ atomic
Əlaqəli Mövzular
- Multiprocessor Systems: Cache coherence və memory ordering
- Synchronization: Lock-free data structures
- Cache Memory: Cache coherence protocols
- Performance: Barrier overhead