Parallelizm Növləri

Parallelizm Nədir?

Parallelizm - bir neçə əməliyyatın eyni vaxtda icra olunmasıdır. Performansı artırmaq üçün əsas üsuldur.

graph TB
    P[Parallelism] --> ILP[Instruction-Level<br/>Parallelism]
    P --> TLP[Thread-Level<br/>Parallelism]
    P --> DLP[Data-Level<br/>Parallelism]
    P --> TLP2[Task-Level<br/>Parallelism]

Flynn's Taxonomy

Parallel sistemlərin klassifikasiyası.

graph TB
    F[Flynn's Taxonomy]
    
    F --> SISD[SISD<br/>Single Instruction<br/>Single Data]
    F --> SIMD[SIMD<br/>Single Instruction<br/>Multiple Data]
    F --> MISD[MISD<br/>Multiple Instruction<br/>Single Data]
    F --> MIMD[MIMD<br/>Multiple Instruction<br/>Multiple Data]
    
    SISD --> S1[Scalar processor]
    SIMD --> S2[Vector processor<br/>GPU]
    MISD --> S3[Nadir<br/>Fault-tolerant systems]
    MIMD --> S4[Multi-core CPU<br/>Distributed systems]

1. Instruction-Level Parallelism (ILP)

Bir thread daxilində təlimatların paralel icrasıdır.

Pipeline Parallelism

gantt
    title 5-Stage Pipeline
    dateFormat X
    axisFormat %s
    
    section Inst 1
    Fetch   :0, 1
    Decode  :1, 2
    Execute :2, 3
    Memory  :3, 4
    Write   :4, 5
    
    section Inst 2
    Fetch   :1, 2
    Decode  :2, 3
    Execute :3, 4
    Memory  :4, 5
    Write   :5, 6
    
    section Inst 3
    Fetch   :2, 3
    Decode  :3, 4
    Execute :4, 5
    Memory  :5, 6
    Write   :6, 7

Ideal IPC: 1 təlimat / cycle

Superscalar Execution

Bir cycle-da bir neçə təlimat icra edir.

graph TB
    Fetch[Fetch Unit] --> Q[Instruction Queue]
    
    Q --> D1[Dispatch]
    D1 --> ALU1[ALU 1]
    D1 --> ALU2[ALU 2]
    D1 --> FPU[FPU]
    D1 --> LSU[Load/Store Unit]
    
    ALU1 --> ROB[Reorder Buffer]
    ALU2 --> ROB
    FPU --> ROB
    LSU --> ROB
    
    ROB --> Commit[Commit]

Modern CPU: 4-8 təlimat / cycle

Out-of-Order Execution

Təlimatlar hazır olduqca icra olunur, proqram sırasında deyil.

// Program order
a = b + c;     // Inst 1 (b cache miss!)
d = e + f;     // Inst 2 (independent)
g = a + d;     // Inst 3 (depends on 1 & 2)

sequenceDiagram
    participant Fetch
    participant Issue
    participant Execute
    participant Commit
    
    Fetch->>Issue: Inst 1, 2, 3
    Issue->>Execute: Inst 2 (ready!)
    Note over Execute: Execute Inst 2 first
    Issue->>Execute: Inst 1 (b loaded)
    Execute->>Execute: Inst 3 (after 1 & 2)
    Execute->>Commit: Commit in order

Üstünlük: Cache miss və ya dependency zamanı stall azalır

ILP Limitləri

Amdahl's Law: Dependencies ILP-ni limitləyir

// Low ILP - tight dependency
int sum = 0;
for (int i = 0; i < N; i++) {
    sum += array[i];  // Dependency chain
}

// Higher ILP - independent operations
int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < N; i += 4) {
    sum1 += array[i];
    sum2 += array[i+1];
    sum3 += array[i+2];
    sum4 += array[i+3];
}

Tipik ILP: 2-4 (real workloads)

2. Thread-Level Parallelism (TLP)

Bir neçə thread-in paralel icrasıdır.

Temporal Multithreading

Thread-lər arasında context switch.

gantt
    title Temporal Multithreading
    dateFormat X
    axisFormat %s
    
    section Thread 1
    Run     :0, 2
    Blocked :2, 4
    Run     :4, 6
    
    section Thread 2
    Idle    :0, 2
    Run     :2, 4
    Idle    :4, 6

Coarse-grained: Cache miss zamanı switch Fine-grained: Hər cycle switch

Simultaneous Multithreading (SMT)

Bir neçə thread eyni vaxtda eyni core-da.

graph TB
    subgraph "SMT Core (Hyper-Threading)"
        Fetch[Shared Fetch]
        
        Fetch --> T1[Thread 1<br/>Context]
        Fetch --> T2[Thread 2<br/>Context]
        
        T1 --> EU[Shared<br/>Execution Units]
        T2 --> EU
        
        EU --> ALU1[ALU 1]
        EU --> ALU2[ALU 2]
        EU --> FPU[FPU]
        EU --> LSU[Load/Store]
    end

Intel Hyper-Threading: 2 thread / core IBM POWER: 4-8 thread / core

Nümunə:

Single-threaded: 40% utilization
2-thread SMT:    70% utilization (1.75x faster)

Multi-Core

Fiziki olaraq ayrı core-lar.

graph TB
    subgraph "Multi-Core CPU"
        C1[Core 1<br/>Thread 1, 2]
        C2[Core 2<br/>Thread 3, 4]
        C3[Core 3<br/>Thread 5, 6]
        C4[Core 4<br/>Thread 7, 8]
        
        C1 --> L3[Shared L3 Cache]
        C2 --> L3
        C3 --> L3
        C4 --> L3
        
        L3 --> Mem[Main Memory]
    end

Üstünlük: Həqiqi parallelizm Çatışmazlıq: Əlavə chip sahəsi, power

TLP Scaling

graph LR
    S[Single Core<br/>1x] --> SMT[SMT 2-way<br/>1.3x]
    SMT --> D[Dual Core<br/>1.8x]
    D --> Q[Quad Core<br/>3.2x]
    Q --> O[Octa Core<br/>5.5x]

Perfect scaling yoxdur:

Synchronization overhead
Memory bandwidth limit
Cache contention
Amdahl's Law

3. Data-Level Parallelism (DLP)

Eyni əməliyyat müxtəlif dataya tətbiq olunur.

SIMD (Single Instruction Multiple Data)

Bir təlimat bir neçə data element-ə.

graph LR
    subgraph "Scalar"
        A1[a] --> Add1[+]
        B1[b] --> Add1
        Add1 --> C1[c]
    end
    
    subgraph "SIMD (4-wide)"
        A[a0 a1 a2 a3] --> Add[+ + + +]
        B[b0 b1 b2 b3] --> Add
        Add --> C[c0 c1 c2 c3]
    end

Vector Processing

x86 SIMD Extensions:

// Scalar: 1 operation
float c = a + b;

// SSE: 4 floats
__m128 va = _mm_load_ps(a);    // Load 4 floats
__m128 vb = _mm_load_ps(b);
__m128 vc = _mm_add_ps(va, vb); // Add 4 floats
_mm_store_ps(c, vc);

// AVX: 8 floats
__m256 va = _mm256_load_ps(a);
__m256 vb = _mm256_load_ps(b);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(c, vc);

// AVX-512: 16 floats (one instruction!)

SIMD Width Evolution

Extension	Width	Elements (float)	Year
MMX	64-bit	2	1997
SSE	128-bit	4	1999
AVX	256-bit	8	2011
AVX-512	512-bit	16	2016

ARM NEON

// ARM NEON intrinsics
float32x4_t va = vld1q_f32(a);    // Load 4 floats
float32x4_t vb = vld1q_f32(b);
float32x4_t vc = vaddq_f32(va, vb); // Add
vst1q_f32(c, vc);                   // Store

Auto-Vectorization

Compiler avtomatik SIMD istifadə edir.

// Simple loop
for (int i = 0; i < N; i++) {
    c[i] = a[i] + b[i];
}

// Compiler generates (with -O3 -march=native):
// vectorized loop using AVX

Alignment vacibdir:

// Aligned allocation
float* a = (float*)aligned_alloc(32, N * sizeof(float));

GPU Processing (SIMT)

SIMT: Single Instruction Multiple Thread

graph TB
    subgraph "GPU Architecture"
        SM1[Streaming<br/>Multiprocessor 1]
        SM2[Streaming<br/>Multiprocessor 2]
        SM3[Streaming<br/>Multiprocessor N]
        
        SM1 --> Core1[32-64 Cores]
        SM2 --> Core2[32-64 Cores]
        SM3 --> Core3[32-64 Cores]
        
        Core1 --> Mem[Global Memory]
        Core2 --> Mem
        Core3 --> Mem
    end

CUDA / OpenCL Example:

__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];  // Thousands of threads!
    }
}

4. Task-Level Parallelism

Müxtəlif task-lar paralel icra olunur.

graph TB
    Program[Program] --> T1[Task 1<br/>Image Processing]
    Program --> T2[Task 2<br/>Audio Decoding]
    Program --> T3[Task 3<br/>Network I/O]
    Program --> T4[Task 4<br/>UI Updates]
    
    T1 --> C1[Core 1]
    T2 --> C2[Core 2]
    T3 --> C3[Core 3]
    T4 --> C4[Core 4]

Fork-Join Pattern

#pragma omp parallel for
for (int i = 0; i < N; i++) {
    process(array[i]);
}

sequenceDiagram
    participant Main
    participant T1 as Thread 1
    participant T2 as Thread 2
    participant T3 as Thread 3
    participant T4 as Thread 4
    
    Main->>T1: Fork
    Main->>T2: Fork
    Main->>T3: Fork
    Main->>T4: Fork
    
    T1->>T1: Work
    T2->>T2: Work
    T3->>T3: Work
    T4->>T4: Work
    
    T1->>Main: Join
    T2->>Main: Join
    T3->>Main: Join
    T4->>Main: Join

Pipeline Pattern

graph LR
    I[Input] --> S1[Stage 1<br/>Thread 1]
    S1 --> S2[Stage 2<br/>Thread 2]
    S2 --> S3[Stage 3<br/>Thread 3]
    S3 --> O[Output]

Nümunə: Video encoding

Stage 1: Read frame
Stage 2: Encode
Stage 3: Write output

Hyper-Threading Details

Intel Hyper-Threading (SMT)

Replicated Resources

Her thread üçün ayrı:

Program Counter
Registers
Flags

Shared Resources

Thread-lər paylaşır:

Execution units (ALU, FPU)
Caches (L1, L2)
TLB

graph TB
    subgraph "HT-Enabled Core"
        subgraph "Thread 1"
            PC1[PC]
            Reg1[Registers]
        end
        
        subgraph "Thread 2"
            PC2[PC]
            Reg2[Registers]
        end
        
        subgraph "Shared"
            ALU[ALUs]
            FPU[FPU]
            Cache[L1/L2 Cache]
        end
        
        PC1 --> ALU
        PC2 --> ALU
        Reg1 --> FPU
        Reg2 --> FPU
    end

Performance Impact

Workload A (compute-heavy):
  Single-thread:  100%
  HT (2 threads): 130%

Workload B (memory-heavy):
  Single-thread:  100%
  HT (2 threads): 170%

Cache miss zamanı digər thread işləyir!

Parallelizm Challenges

1. Synchronization Overhead

// High overhead - lock per iteration
for (int i = 0; i < N; i++) {
    lock();
    shared_counter++;
    unlock();
}

// Low overhead - local accumulation
int local = 0;
for (int i = 0; i < N; i++) {
    local++;
}
lock();
shared_counter += local;
unlock();

2. Load Balancing

graph TB
    subgraph "Poor Balance"
        T1[Thread 1<br/>80% work] 
        T2[Thread 2<br/>10% work]
        T3[Thread 3<br/>10% work]
    end
    
    subgraph "Good Balance"
        T4[Thread 1<br/>33% work]
        T5[Thread 2<br/>33% work]
        T6[Thread 3<br/>34% work]
    end

struct Counter {
    int count1;  // Thread 1
    int count2;  // Thread 2
}; // Same cache line!

// Solution: Padding
struct Counter {
    int count1;
    char pad[60];
    int count2;
};

4. Memory Bandwidth

Single core:  20 GB/s
4 cores:      40 GB/s (not 80!)
8 cores:      60 GB/s

Memory bandwidth limit!

Parallelism Patterns

Embarrassingly Parallel

Heç bir communication yoxdur.

// Perfect parallelism
#pragma omp parallel for
for (int i = 0; i < N; i++) {
    output[i] = expensive_computation(input[i]);
}

Scalability: Near-linear

Reduction

Combine results.

int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
    sum += array[i];
}

Producer-Consumer

Queue<Task> queue;

// Producer thread
while (has_work) {
    queue.push(create_task());
}

// Consumer threads
while (true) {
    Task t = queue.pop();
    process(t);
}

Performance Metrics

Speedup

Speedup = T(1) / T(N)

Burada T(N) = N processor ilə vaxt

Efficiency

Efficiency = Speedup / N

Ideal: 1.0 (100%) Tipik: 0.7-0.9 (70-90%)

Scalability

graph LR
    N[Cores] --> S[Speedup]
    
--> 1x[1.0x]
--> 2x[1.8x]
--> 4x[3.2x]
--> 8x[5.5x]
--> 16x[9.0x]

Heterogeneous Computing

Müxtəlif processor növləri birlikdə.

graph TB
    App[Application]
    
    App --> CPU[CPU<br/>Serial + Control]
    App --> GPU[GPU<br/>Parallel Compute]
    App --> FPGA[FPGA<br/>Custom Logic]
    App --> DSP[DSP<br/>Signal Processing]
    
    CPU --> Mem[Unified Memory]
    GPU --> Mem
    FPGA --> Mem
    DSP --> Mem

Nümunə: Apple M1/M2

CPU cores (performance + efficiency)
GPU cores
Neural Engine
Video encode/decode

Best Practices

Profile first
- Parallel hissəni tap
- Bottleneck-ləri müəyyənləşdir
Coarse-grained parallelism
- Böyük task-lar
- Az synchronization
Minimize communication
- Local data
- Batch operations
Use libraries
- OpenMP, TBB, PPL
- BLAS, cuBLAS
Test scaling
- 1, 2, 4, 8... cores
- Measure speedup

Əlaqəli Mövzular

CPU Architecture: Superscalar və out-of-order
Multiprocessor Systems: Multi-core architecture
Synchronization: Lock-free algorithms
Performance: Amdahl's Law və scaling
Cache Memory: False sharing

Parallelizm Nədir?​

Flynn's Taxonomy​

1. Instruction-Level Parallelism (ILP)​

Pipeline Parallelism​

Superscalar Execution​

Out-of-Order Execution​

ILP Limitləri​

2. Thread-Level Parallelism (TLP)​

Temporal Multithreading​

Simultaneous Multithreading (SMT)​

Multi-Core​

TLP Scaling​

3. Data-Level Parallelism (DLP)​

SIMD (Single Instruction Multiple Data)​

Vector Processing​

SIMD Width Evolution​

ARM NEON​

Auto-Vectorization​

GPU Processing (SIMT)​

4. Task-Level Parallelism​

Fork-Join Pattern​

Pipeline Pattern​

Hyper-Threading Details​

Replicated Resources​

Shared Resources​

Performance Impact​

Parallelizm Challenges​

1. Synchronization Overhead​

2. Load Balancing​

3. False Sharing​

4. Memory Bandwidth​

Parallelism Patterns​

Embarrassingly Parallel​

Reduction​

Producer-Consumer​

Performance Metrics​

Speedup​

Efficiency​

Scalability​

Heterogeneous Computing​

Best Practices​

Əlaqəli Mövzular​