Multiprocessor Sistemlər

Multiprocessor Nədir?

Multiprocessor - bir neçə processor-un eyni sistemdə işləməsidir. Performans və throughput artırır.

graph TB
    MP[Multiprocessor Systems]
    
    MP --> SM[Shared Memory]
    MP --> DM[Distributed Memory]
    
    SM --> UMA[UMA<br/>Uniform Memory Access]
    SM --> NUMA[NUMA<br/>Non-Uniform Memory Access]
    
    DM --> Cluster[Cluster]
    DM --> Grid[Grid Computing]

Shared Memory vs Distributed Memory

Shared Memory

Bütün processor-lar eyni memory space-ə çıxış edir.

graph TB
    subgraph "Shared Memory"
        P1[Processor 1]
        P2[Processor 2]
        P3[Processor 3]
        P4[Processor 4]
        
        P1 --> Bus[Interconnect]
        P2 --> Bus
        P3 --> Bus
        P4 --> Bus
        
        Bus --> Mem[Shared Memory]
    end

Üstünlüklər:

Proqramlaması asan
Data sharing sadə
Single address space

Çatışmazlıqlar:

Scalability limit
Memory contention
Cache coherence complexity

Distributed Memory

Hər processor-un öz local memory-si var.

graph TB
    subgraph "Distributed Memory"
        subgraph "Node 1"
            P1[Processor 1]
            M1[Memory 1]
        end
        
        subgraph "Node 2"
            P2[Processor 2]
            M2[Memory 2]
        end
        
        subgraph "Node 3"
            P3[Processor 3]
            M3[Memory 3]
        end
        
        P1 <--> Network[Network]
        P2 <--> Network
        P3 <--> Network
    end

Üstünlüklər:

Yaxşı scalability
No memory contention
Cost-effective

Çatışmazlıqlar:

Message passing overhead
Proqramlaması çətin
Data distribution kompleks

UMA (Uniform Memory Access)

Bütün processor-lar üçün eyni memory access time.

graph TB
    subgraph "UMA Architecture"
        P1[CPU 1] --> Bus[Shared Bus]
        P2[CPU 2] --> Bus
        P3[CPU 3] --> Bus
        P4[CPU 4] --> Bus
        
        Bus --> MC[Memory Controller]
        MC --> RAM[RAM]
    end
    
    Note[All CPUs: Same latency to RAM]

Xüsusiyyətlər:

Simple architecture
Predictable performance
Cache coherence protokolları

Limitlər:

Bus bandwidth bottleneck
4-8 processor-dan çox çətindir
Memory wall

SMP (Symmetric Multi-Processing)

UMA-nın ən populyar forması.

graph TB
    subgraph "SMP System"
        subgraph "CPU 1"
            Core1[Core]
            L1_1[L1]
            L2_1[L2]
        end
        
        subgraph "CPU 2"
            Core2[Core]
            L1_2[L1]
            L2_2[L2]
        end
        
        subgraph "CPU 3"
            Core3[Core]
            L1_3[L1]
            L2_3[L2]
        end
        
        subgraph "CPU 4"
            Core4[Core]
            L1_4[L1]
            L2_4[L4]
        end
        
        L2_1 --> L3[Shared L3 Cache]
        L2_2 --> L3
        L2_3 --> L3
        L2_4 --> L3
        
        L3 --> MC[Memory Controller]
        MC --> DDR[DDR Memory]
    end

Nümunə: Tipik desktop/server CPU (Intel Core, AMD Ryzen)

NUMA (Non-Uniform Memory Access)

Memory access time processor-dan asılıdır.

graph TB
    subgraph "NUMA Architecture"
        subgraph "Node 0"
            P0[Processor 0]
            M0[Local Memory 0]
            P0 <-->|Fast| M0
        end
        
        subgraph "Node 1"
            P1[Processor 1]
            M1[Local Memory 1]
            P1 <-->|Fast| M1
        end
        
        subgraph "Node 2"
            P2[Processor 2]
            M2[Local Memory 2]
            P2 <-->|Fast| M2
        end
        
        P0 <-->|Slow| IC[Interconnect]
        P1 <-->|Slow| IC
        P2 <-->|Slow| IC
        
        IC --> M1
        IC --> M2
        IC --> M0
    end

Local vs Remote Access

Local Memory Access:  50-100 ns
Remote Memory Access: 150-300 ns (2-3x slower!)

NUMA Ratio:

NUMA Ratio = Remote Access Time / Local Access Time
Typical: 1.5 - 3.0

NUMA Nodes

# Linux: Check NUMA topology
numactl --hardware

# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 2 4 6 8 10 12 14
# node 0 size: 65536 MB
# node 1 cpus: 1 3 5 7 9 11 13 15
# node 1 size: 65536 MB
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

Distance 10: Local Distance 21: Remote (2.1x slower)

NUMA Optimizasiya

1. Memory Affinity

// Allocate memory on specific NUMA node
#include <numa.h>

void* ptr = numa_alloc_onnode(size, node_id);

2. CPU Affinity

// Bind thread to specific CPU
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
pthread_setaffinity_np(thread, sizeof(cpuset), &cpuset);

3. First-Touch Policy

// Allocate on node where first touched
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    array[i] = 0;  // Initialize on local node
}

Cache Coherence

Multi-processor sistemdə cache consistency problemi.

Problem

sequenceDiagram
    participant CPU1
    participant Cache1
    participant Memory
    participant Cache2
    participant CPU2
    
    CPU1->>Memory: Read X (0)
    Memory->>Cache1: X = 0
    
    CPU2->>Memory: Read X (0)
    Memory->>Cache2: X = 0
    
    CPU1->>Cache1: Write X = 1
    
    Note over Cache1: X = 1 (modified)
    Note over Cache2: X = 0 (stale!)
    Note over Memory: X = 0 (out of date)

Snooping Protocol

Bus-based systems üçün.

graph TB
    C1[Cache 1] --> Bus[Shared Bus]
    C2[Cache 2] --> Bus
    C3[Cache 3] --> Bus
    C4[Cache 4] --> Bus
    
    Bus --> Mem[Memory]
    
    Note[All caches snoop bus transactions]

Mərhələlər:

CPU write əməliyyatı edir
Bus-a write broadcast edilir
Digər cache-lər snoop edir
Lazımsa invalidate və ya update edirlər

Directory-Based Protocol

Scalable systems üçün (NUMA).

graph TB
    subgraph "Node 0"
        C0[Cache 0]
        D0[Directory 0]
        M0[Memory 0]
    end
    
    subgraph "Node 1"
        C1[Cache 1]
        D1[Directory 1]
        M1[Memory 1]
    end
    
    subgraph "Node 2"
        C2[Cache 2]
        D2[Directory 2]
        M2[Memory 2]
    end
    
    C0 <--> IC[Interconnect]
    C1 <--> IC
    C2 <--> IC
    
    D0 --> M0
    D1 --> M1
    D2 --> M2

Directory: Hansı cache-lərdə hansı data var?

MESI Protocol (Snooping)

stateDiagram-v2
    [*] --> I
    
    I: Invalid
    E: Exclusive
    S: Shared
    M: Modified
    
    I --> E: Read (no other cache)
    I --> S: Read (other caches have)
    
    E --> M: Write
    E --> S: Other cache reads
    
    S --> M: Write
    S --> I: Other cache writes
    
    M --> S: Other cache reads
    M --> I: Other cache writes
    
    E --> I: Invalidate

MOESI Protocol (AMD)

MESI + Owned state.

O (Owned):

Cache dirty copy var
Başqa cache-lər read-only copy ala bilər
Write-back məsuliyyəti bu cache-də

Üstünlük: Cache-to-cache transfer (memory-ə write yox)

Interconnection Networks

1. Bus

Sadə, amma scalability yoxdur.

graph LR
    P1[Proc 1] --> Bus[Shared Bus]
    P2[Proc 2] --> Bus
    P3[Proc 3] --> Bus
    P4[Proc 4] --> Bus
    Bus --> Mem[Memory]

Bandwidth: Bütün processor-lar paylaşır Limit: 4-8 processor

2. Crossbar Switch

Hər processor hər memory-ə.

graph TB
    P1[P1] --> X[Crossbar Switch]
    P2[P2] --> X
    P3[P3] --> X
    P4[P4] --> X
    
    X --> M1[M1]
    X --> M2[M2]
    X --> M3[M3]
    X --> M4[M4]

Bandwidth: Yüksək (parallel paths) Cost: O(N²) - bahalı

3. Mesh Network

2D mesh topology.

graph TB
    P00[P0,0] --- P01[P0,1] --- P02[P0,2]
    P00 --- P10[P1,0]
    P01 --- P11[P1,1]
    P02 --- P12[P1,2]
    P10 --- P11 --- P12
    P10 --- P20[P2,0]
    P11 --- P21[P2,1]
    P12 --- P22[P2,2]
    P20 --- P21 --- P22

Üstünlük: Scalable, simple Çatışmazlıq: Variable latency (corner vs center)

4. Torus

Mesh + wrap-around connections.

Üstünlük: Daha qısa average distance

5. Hypercube

N-dimensional cube.

Dimensions:

1D: 2 nodes
2D: 4 nodes (square)
3D: 8 nodes (cube)
nD: 2ⁿ nodes

Diameter: log₂(N)

6. Fat Tree

Switch hierarchy.

graph TB
    Root[Root Switches]
    
    Root --> L1_1[Level 1 Switch]
    Root --> L1_2[Level 1 Switch]
    Root --> L1_3[Level 1 Switch]
    
    L1_1 --> L2_1[Level 2 Switch]
    L1_1 --> L2_2[Level 2 Switch]
    
    L1_2 --> L2_3[Level 2 Switch]
    L1_2 --> L2_4[Level 2 Switch]
    
    L2_1 --> P1[Proc 1]
    L2_1 --> P2[Proc 2]
    L2_2 --> P3[Proc 3]
    L2_2 --> P4[Proc 4]

Üstünlük: High bisection bandwidth Nümunə: Data center networks

Scalability Challenges

1. Memory Bandwidth Wall

Single Core:     50 GB/s
Cores:         100 GB/s
Cores:         120 GB/s (saturated!)
Cores:        120 GB/s (no gain)

Həll: NUMA, multiple memory controllers

2. Cache Coherence Traffic

graph LR
    Cores[Number of Cores] --> Traffic[Coherence Traffic]
    
--> T2[Minimal]
--> T4[Low]
--> T8[Medium]
--> T16[High]
--> T32[Very High]
--> T64[Critical!]

Overhead: O(N²) worst case

3. Synchronization Bottleneck

// Global lock - serialization point
pthread_mutex_lock(&global_lock);
shared_counter++;
pthread_mutex_unlock(&global_lock);

Həll: Lock-free algorithms, per-thread data

4. Load Imbalance

gantt
    title Load Imbalance
    dateFormat X
    axisFormat %s
    
    section Thread 1
    Work :0, 10
    Idle :10, 12
    
    section Thread 2
    Work :0, 12
    
    section Thread 3
    Work :0, 8
    Idle :8, 12
    
    section Thread 4
    Work :0, 6
    Idle :6, 12

Həll: Dynamic work distribution

Real-World Multiprocessor Systems

Intel Xeon Scalable

Architecture: NUMA Sockets: 2-8 Cores/Socket: 8-64 Interconnect: UPI (Ultra Path Interconnect)

graph TB
    subgraph "Socket 0"
        C0[Cores 0-31]
        M0[Memory 0]
    end
    
    subgraph "Socket 1"
        C1[Cores 32-63]
        M1[Memory 1]
    end
    
    C0 <-->|UPI| C1
    
    C0 <--> M0
    C1 <--> M1
    C0 -.->|Remote| M1
    C1 -.->|Remote| M0

AMD EPYC

Architecture: NUMA (Chiplet-based) Chiplets: 8-12 per socket Cores: 64-96 per socket Interconnect: Infinity Fabric

graph TB
    subgraph "EPYC Socket"
        IOD[I/O Die]
        
        IOD --> CCD1[Chiplet 1<br/>8 cores]
        IOD --> CCD2[Chiplet 2<br/>8 cores]
        IOD --> CCD3[Chiplet 3<br/>8 cores]
        IOD --> CCD4[Chiplet 4<br/>8 cores]
        IOD --> CCD5[Chiplet 5<br/>8 cores]
        IOD --> CCD6[Chiplet 6<br/>8 cores]
    end

ARM Server

Neoverse: Cloud/HPC Cores: 32-128 per chip Mesh interconnect

IBM POWER

SMT-8: 8 thread/core Cores: 12-24/chip Strong RAS features

ccNUMA (Cache Coherent NUMA)

Hardware cache coherence + NUMA.

graph TB
    subgraph "Node 0"
        P0[Processor 0]
        C0[Coherent Cache]
        M0[Memory 0]
        P0 --> C0 --> M0
    end
    
    subgraph "Node 1"
        P1[Processor 1]
        C1[Coherent Cache]
        M1[Memory 1]
        P1 --> C1 --> M1
    end
    
    C0 <-->|Cache Coherence| Interconnect[Coherent Interconnect]
    C1 <-->|Cache Coherence| Interconnect
    
    Interconnect --> M0
    Interconnect --> M1

Nümunə: Modern server systems

Performance Tuning

1. NUMA Awareness

# Run on specific node
numactl --cpunodebind=0 --membind=0 ./program

# Interleave memory
numactl --interleave=all ./program

2. Profiling

# NUMA statistics
numastat

# Per-process NUMA stats
numastat -p PID

3. Minimize Remote Access

// BAD: Random access across nodes
for (int i = 0; i < N; i++) {
    process(array[random() % N]);
}

// GOOD: Local access pattern
int start = thread_id * (N / num_threads);
int end = start + (N / num_threads);
for (int i = start; i < end; i++) {
    process(array[i]);
}

4. Reduce Coherence Traffic

// BAD: False sharing
struct {
    int counter1;  // Thread 1
    int counter2;  // Thread 2
} counters;

// GOOD: Padding
struct {
    int counter1;
    char pad[64];
    int counter2;
} counters;

Best Practices

NUMA-aware programming
- Allocate memory on local node
- Minimize remote access
Reduce synchronization
- Per-thread data structures
- Lock-free algorithms
Balance load
- Dynamic scheduling
- Work stealing
Profile performance
- Identify NUMA issues
- Cache coherence overhead
Consider architecture
- UMA vs NUMA
- Memory bandwidth limits

Əlaqəli Mövzular

Cache Memory: Cache coherence protocols
Memory Ordering: Multi-processor consistency
Synchronization: Lock-free algorithms
Parallelism: Thread-level parallelism
Performance: Scalability limits

Multiprocessor Nədir?​

Shared Memory vs Distributed Memory​

Shared Memory​

Distributed Memory​

UMA (Uniform Memory Access)​

SMP (Symmetric Multi-Processing)​

NUMA (Non-Uniform Memory Access)​

Local vs Remote Access​

NUMA Nodes​

NUMA Optimizasiya​

Cache Coherence​

Problem​

Snooping Protocol​

Directory-Based Protocol​

MESI Protocol (Snooping)​

MOESI Protocol (AMD)​

Interconnection Networks​

1. Bus​

2. Crossbar Switch​

3. Mesh Network​

4. Torus​

5. Hypercube​

6. Fat Tree​

Scalability Challenges​

1. Memory Bandwidth Wall​

2. Cache Coherence Traffic​

3. Synchronization Bottleneck​

4. Load Imbalance​

Real-World Multiprocessor Systems​

Intel Xeon Scalable​

AMD EPYC​

ARM Server​

IBM POWER​

ccNUMA (Cache Coherent NUMA)​

Performance Tuning​

1. NUMA Awareness​

2. Profiling​

3. Minimize Remote Access​

4. Reduce Coherence Traffic​

Best Practices​

Əlaqəli Mövzular​