Multiprocessor Sistemlər
Multiprocessor Nədir?
Multiprocessor - bir neçə processor-un eyni sistemdə işləməsidir. Performans və throughput artırır.
graph TB
MP[Multiprocessor Systems]
MP --> SM[Shared Memory]
MP --> DM[Distributed Memory]
SM --> UMA[UMA<br/>Uniform Memory Access]
SM --> NUMA[NUMA<br/>Non-Uniform Memory Access]
DM --> Cluster[Cluster]
DM --> Grid[Grid Computing]
Shared Memory vs Distributed Memory
Shared Memory
Bütün processor-lar eyni memory space-ə çıxış edir.
graph TB
subgraph "Shared Memory"
P1[Processor 1]
P2[Processor 2]
P3[Processor 3]
P4[Processor 4]
P1 --> Bus[Interconnect]
P2 --> Bus
P3 --> Bus
P4 --> Bus
Bus --> Mem[Shared Memory]
end
Üstünlüklər:
- Proqramlaması asan
- Data sharing sadə
- Single address space
Çatışmazlıqlar:
- Scalability limit
- Memory contention
- Cache coherence complexity
Distributed Memory
Hər processor-un öz local memory-si var.
graph TB
subgraph "Distributed Memory"
subgraph "Node 1"
P1[Processor 1]
M1[Memory 1]
end
subgraph "Node 2"
P2[Processor 2]
M2[Memory 2]
end
subgraph "Node 3"
P3[Processor 3]
M3[Memory 3]
end
P1 <--> Network[Network]
P2 <--> Network
P3 <--> Network
end
Üstünlüklər:
- Yaxşı scalability
- No memory contention
- Cost-effective
Çatışmazlıqlar:
- Message passing overhead
- Proqramlaması çətin
- Data distribution kompleks
UMA (Uniform Memory Access)
Bütün processor-lar üçün eyni memory access time.
graph TB
subgraph "UMA Architecture"
P1[CPU 1] --> Bus[Shared Bus]
P2[CPU 2] --> Bus
P3[CPU 3] --> Bus
P4[CPU 4] --> Bus
Bus --> MC[Memory Controller]
MC --> RAM[RAM]
end
Note[All CPUs: Same latency to RAM]
Xüsusiyyətlər:
- Simple architecture
- Predictable performance
- Cache coherence protokolları
Limitlər:
- Bus bandwidth bottleneck
- 4-8 processor-dan çox çətindir
- Memory wall
SMP (Symmetric Multi-Processing)
UMA-nın ən populyar forması.
graph TB
subgraph "SMP System"
subgraph "CPU 1"
Core1[Core]
L1_1[L1]
L2_1[L2]
end
subgraph "CPU 2"
Core2[Core]
L1_2[L1]
L2_2[L2]
end
subgraph "CPU 3"
Core3[Core]
L1_3[L1]
L2_3[L2]
end
subgraph "CPU 4"
Core4[Core]
L1_4[L1]
L2_4[L4]
end
L2_1 --> L3[Shared L3 Cache]
L2_2 --> L3
L2_3 --> L3
L2_4 --> L3
L3 --> MC[Memory Controller]
MC --> DDR[DDR Memory]
end
Nümunə: Tipik desktop/server CPU (Intel Core, AMD Ryzen)
NUMA (Non-Uniform Memory Access)
Memory access time processor-dan asılıdır.
graph TB
subgraph "NUMA Architecture"
subgraph "Node 0"
P0[Processor 0]
M0[Local Memory 0]
P0 <-->|Fast| M0
end
subgraph "Node 1"
P1[Processor 1]
M1[Local Memory 1]
P1 <-->|Fast| M1
end
subgraph "Node 2"
P2[Processor 2]
M2[Local Memory 2]
P2 <-->|Fast| M2
end
P0 <-->|Slow| IC[Interconnect]
P1 <-->|Slow| IC
P2 <-->|Slow| IC
IC --> M1
IC --> M2
IC --> M0
end
Local vs Remote Access
Local Memory Access: 50-100 ns
Remote Memory Access: 150-300 ns (2-3x slower!)
NUMA Ratio:
NUMA Ratio = Remote Access Time / Local Access Time
Typical: 1.5 - 3.0
NUMA Nodes
# Linux: Check NUMA topology
numactl --hardware
# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 2 4 6 8 10 12 14
# node 0 size: 65536 MB
# node 1 cpus: 1 3 5 7 9 11 13 15
# node 1 size: 65536 MB
# node distances:
# node 0 1
# 0: 10 21
# 1: 21 10
Distance 10: Local Distance 21: Remote (2.1x slower)
NUMA Optimizasiya
1. Memory Affinity
// Allocate memory on specific NUMA node
#include <numa.h>
void* ptr = numa_alloc_onnode(size, node_id);
2. CPU Affinity
// Bind thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);
3. First-Touch Policy
// Allocate on node where first touched
#pragma omp parallel for
for (int i = 0; i < N; i++) {
array[i] = 0; // Initialize on local node
}
Cache Coherence
Multi-processor sistemdə cache consistency problemi.
Problem
sequenceDiagram
participant CPU1
participant Cache1
participant Memory
participant Cache2
participant CPU2
CPU1->>Memory: Read X (0)
Memory->>Cache1: X = 0
CPU2->>Memory: Read X (0)
Memory->>Cache2: X = 0
CPU1->>Cache1: Write X = 1
Note over Cache1: X = 1 (modified)
Note over Cache2: X = 0 (stale!)
Note over Memory: X = 0 (out of date)
Snooping Protocol
Bus-based systems üçün.
graph TB
C1[Cache 1] --> Bus[Shared Bus]
C2[Cache 2] --> Bus
C3[Cache 3] --> Bus
C4[Cache 4] --> Bus
Bus --> Mem[Memory]
Note[All caches snoop bus transactions]
Mərhələlər:
- CPU write əməliyyatı edir
- Bus-a write broadcast edilir
- Digər cache-lər snoop edir
- Lazımsa invalidate və ya update edirlər
Directory-Based Protocol
Scalable systems üçün (NUMA).
graph TB
subgraph "Node 0"
C0[Cache 0]
D0[Directory 0]
M0[Memory 0]
end
subgraph "Node 1"
C1[Cache 1]
D1[Directory 1]
M1[Memory 1]
end
subgraph "Node 2"
C2[Cache 2]
D2[Directory 2]
M2[Memory 2]
end
C0 <--> IC[Interconnect]
C1 <--> IC
C2 <--> IC
D0 --> M0
D1 --> M1
D2 --> M2
Directory: Hansı cache-lərdə hansı data var?
MESI Protocol (Snooping)
stateDiagram-v2
[*] --> I
I: Invalid
E: Exclusive
S: Shared
M: Modified
I --> E: Read (no other cache)
I --> S: Read (other caches have)
E --> M: Write
E --> S: Other cache reads
S --> M: Write
S --> I: Other cache writes
M --> S: Other cache reads
M --> I: Other cache writes
E --> I: Invalidate
MOESI Protocol (AMD)
MESI + Owned state.
O (Owned):
- Cache dirty copy var
- Başqa cache-lər read-only copy ala bilər
- Write-back məsuliyyəti bu cache-də
Üstünlük: Cache-to-cache transfer (memory-ə write yox)
Interconnection Networks
1. Bus
Sadə, amma scalability yoxdur.
graph LR
P1[Proc 1] --> Bus[Shared Bus]
P2[Proc 2] --> Bus
P3[Proc 3] --> Bus
P4[Proc 4] --> Bus
Bus --> Mem[Memory]
Bandwidth: Bütün processor-lar paylaşır Limit: 4-8 processor
2. Crossbar Switch
Hər processor hər memory-ə.
graph TB
P1[P1] --> X[Crossbar Switch]
P2[P2] --> X
P3[P3] --> X
P4[P4] --> X
X --> M1[M1]
X --> M2[M2]
X --> M3[M3]
X --> M4[M4]
Bandwidth: Yüksək (parallel paths) Cost: O(N²) - bahalı
3. Mesh Network
2D mesh topology.
graph TB
P00[P0,0] --- P01[P0,1] --- P02[P0,2]
P00 --- P10[P1,0]
P01 --- P11[P1,1]
P02 --- P12[P1,2]
P10 --- P11 --- P12
P10 --- P20[P2,0]
P11 --- P21[P2,1]
P12 --- P22[P2,2]
P20 --- P21 --- P22
Üstünlük: Scalable, simple Çatışmazlıq: Variable latency (corner vs center)
4. Torus
Mesh + wrap-around connections.
Üstünlük: Daha qısa average distance
5. Hypercube
N-dimensional cube.
Dimensions:
- 1D: 2 nodes
- 2D: 4 nodes (square)
- 3D: 8 nodes (cube)
- nD: 2ⁿ nodes
Diameter: log₂(N)
6. Fat Tree
Switch hierarchy.
graph TB
Root[Root Switches]
Root --> L1_1[Level 1 Switch]
Root --> L1_2[Level 1 Switch]
Root --> L1_3[Level 1 Switch]
L1_1 --> L2_1[Level 2 Switch]
L1_1 --> L2_2[Level 2 Switch]
L1_2 --> L2_3[Level 2 Switch]
L1_2 --> L2_4[Level 2 Switch]
L2_1 --> P1[Proc 1]
L2_1 --> P2[Proc 2]
L2_2 --> P3[Proc 3]
L2_2 --> P4[Proc 4]
Üstünlük: High bisection bandwidth Nümunə: Data center networks
Scalability Challenges
1. Memory Bandwidth Wall
Single Core: 50 GB/s
4 Cores: 100 GB/s
8 Cores: 120 GB/s (saturated!)
16 Cores: 120 GB/s (no gain)
Həll: NUMA, multiple memory controllers
2. Cache Coherence Traffic
graph LR
Cores[Number of Cores] --> Traffic[Coherence Traffic]
2 --> T2[Minimal]
4 --> T4[Low]
8 --> T8[Medium]
16 --> T16[High]
32 --> T32[Very High]
64 --> T64[Critical!]
Overhead: O(N²) worst case
3. Synchronization Bottleneck
// Global lock - serialization point
pthread_mutex_lock(&global_lock);
shared_counter++;
pthread_mutex_unlock(&global_lock);
Həll: Lock-free algorithms, per-thread data
4. Load Imbalance
gantt
title Load Imbalance
dateFormat X
axisFormat %s
section Thread 1
Work :0, 10
Idle :10, 12
section Thread 2
Work :0, 12
section Thread 3
Work :0, 8
Idle :8, 12
section Thread 4
Work :0, 6
Idle :6, 12
Həll: Dynamic work distribution
Real-World Multiprocessor Systems
Intel Xeon Scalable
Architecture: NUMA Sockets: 2-8 Cores/Socket: 8-64 Interconnect: UPI (Ultra Path Interconnect)
graph TB
subgraph "Socket 0"
C0[Cores 0-31]
M0[Memory 0]
end
subgraph "Socket 1"
C1[Cores 32-63]
M1[Memory 1]
end
C0 <-->|UPI| C1
C0 <--> M0
C1 <--> M1
C0 -.->|Remote| M1
C1 -.->|Remote| M0
AMD EPYC
Architecture: NUMA (Chiplet-based) Chiplets: 8-12 per socket Cores: 64-96 per socket Interconnect: Infinity Fabric
graph TB
subgraph "EPYC Socket"
IOD[I/O Die]
IOD --> CCD1[Chiplet 1<br/>8 cores]
IOD --> CCD2[Chiplet 2<br/>8 cores]
IOD --> CCD3[Chiplet 3<br/>8 cores]
IOD --> CCD4[Chiplet 4<br/>8 cores]
IOD --> CCD5[Chiplet 5<br/>8 cores]
IOD --> CCD6[Chiplet 6<br/>8 cores]
end
ARM Server
Neoverse: Cloud/HPC Cores: 32-128 per chip Mesh interconnect
IBM POWER
SMT-8: 8 thread/core Cores: 12-24/chip Strong RAS features
ccNUMA (Cache Coherent NUMA)
Hardware cache coherence + NUMA.
graph TB
subgraph "Node 0"
P0[Processor 0]
C0[Coherent Cache]
M0[Memory 0]
P0 --> C0 --> M0
end
subgraph "Node 1"
P1[Processor 1]
C1[Coherent Cache]
M1[Memory 1]
P1 --> C1 --> M1
end
C0 <-->|Cache Coherence| Interconnect[Coherent Interconnect]
C1 <-->|Cache Coherence| Interconnect
Interconnect --> M0
Interconnect --> M1
Nümunə: Modern server systems
Performance Tuning
1. NUMA Awareness
# Run on specific node
numactl --cpunodebind=0 --membind=0 ./program
# Interleave memory
numactl --interleave=all ./program
2. Profiling
# NUMA statistics
numastat
# Per-process NUMA stats
numastat -p PID
3. Minimize Remote Access
// BAD: Random access across nodes
for (int i = 0; i < N; i++) {
process(array[random() % N]);
}
// GOOD: Local access pattern
int start = thread_id * (N / num_threads);
int end = start + (N / num_threads);
for (int i = start; i < end; i++) {
process(array[i]);
}
4. Reduce Coherence Traffic
// BAD: False sharing
struct {
int counter1; // Thread 1
int counter2; // Thread 2
} counters;
// GOOD: Padding
struct {
int counter1;
char pad[64];
int counter2;
} counters;
Best Practices
-
NUMA-aware programming
- Allocate memory on local node
- Minimize remote access
-
Reduce synchronization
- Per-thread data structures
- Lock-free algorithms
-
Balance load
- Dynamic scheduling
- Work stealing
-
Profile performance
- Identify NUMA issues
- Cache coherence overhead
-
Consider architecture
- UMA vs NUMA
- Memory bandwidth limits
Əlaqəli Mövzular
- Cache Memory: Cache coherence protocols
- Memory Ordering: Multi-processor consistency
- Synchronization: Lock-free algorithms
- Parallelism: Thread-level parallelism
- Performance: Scalability limits