Parallelizm Növləri
Parallelizm Nədir?
Parallelizm - bir neçə əməliyyatın eyni vaxtda icra olunmasıdır. Performansı artırmaq üçün əsas üsuldur.
graph TB
P[Parallelism] --> ILP[Instruction-Level<br/>Parallelism]
P --> TLP[Thread-Level<br/>Parallelism]
P --> DLP[Data-Level<br/>Parallelism]
P --> TLP2[Task-Level<br/>Parallelism]
Flynn's Taxonomy
Parallel sistemlərin klassifikasiyası.
graph TB
F[Flynn's Taxonomy]
F --> SISD[SISD<br/>Single Instruction<br/>Single Data]
F --> SIMD[SIMD<br/>Single Instruction<br/>Multiple Data]
F --> MISD[MISD<br/>Multiple Instruction<br/>Single Data]
F --> MIMD[MIMD<br/>Multiple Instruction<br/>Multiple Data]
SISD --> S1[Scalar processor]
SIMD --> S2[Vector processor<br/>GPU]
MISD --> S3[Nadir<br/>Fault-tolerant systems]
MIMD --> S4[Multi-core CPU<br/>Distributed systems]
1. Instruction-Level Parallelism (ILP)
Bir thread daxilində təlimatların paralel icrasıdır.
Pipeline Parallelism
gantt
title 5-Stage Pipeline
dateFormat X
axisFormat %s
section Inst 1
Fetch :0, 1
Decode :1, 2
Execute :2, 3
Memory :3, 4
Write :4, 5
section Inst 2
Fetch :1, 2
Decode :2, 3
Execute :3, 4
Memory :4, 5
Write :5, 6
section Inst 3
Fetch :2, 3
Decode :3, 4
Execute :4, 5
Memory :5, 6
Write :6, 7
Ideal IPC: 1 təlimat / cycle
Superscalar Execution
Bir cycle-da bir neçə təlimat icra edir.
graph TB
Fetch[Fetch Unit] --> Q[Instruction Queue]
Q --> D1[Dispatch]
D1 --> ALU1[ALU 1]
D1 --> ALU2[ALU 2]
D1 --> FPU[FPU]
D1 --> LSU[Load/Store Unit]
ALU1 --> ROB[Reorder Buffer]
ALU2 --> ROB
FPU --> ROB
LSU --> ROB
ROB --> Commit[Commit]
Modern CPU: 4-8 təlimat / cycle
Out-of-Order Execution
Təlimatlar hazır olduqca icra olunur, proqram sırasında deyil.
// Program order
a = b + c; // Inst 1 (b cache miss!)
d = e + f; // Inst 2 (independent)
g = a + d; // Inst 3 (depends on 1 & 2)
sequenceDiagram
participant Fetch
participant Issue
participant Execute
participant Commit
Fetch->>Issue: Inst 1, 2, 3
Issue->>Execute: Inst 2 (ready!)
Note over Execute: Execute Inst 2 first
Issue->>Execute: Inst 1 (b loaded)
Execute->>Execute: Inst 3 (after 1 & 2)
Execute->>Commit: Commit in order
Üstünlük: Cache miss və ya dependency zamanı stall azalır
ILP Limitləri
Amdahl's Law: Dependencies ILP-ni limitləyir
// Low ILP - tight dependency
int sum = 0;
for (int i = 0; i < N; i++) {
sum += array[i]; // Dependency chain
}
// Higher ILP - independent operations
int sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0;
for (int i = 0; i < N; i += 4) {
sum1 += array[i];
sum2 += array[i+1];
sum3 += array[i+2];
sum4 += array[i+3];
}
Tipik ILP: 2-4 (real workloads)
2. Thread-Level Parallelism (TLP)
Bir neçə thread-in paralel icrasıdır.
Temporal Multithreading
Thread-lər arasında context switch.
gantt
title Temporal Multithreading
dateFormat X
axisFormat %s
section Thread 1
Run :0, 2
Blocked :2, 4
Run :4, 6
section Thread 2
Idle :0, 2
Run :2, 4
Idle :4, 6
Coarse-grained: Cache miss zamanı switch Fine-grained: Hər cycle switch
Simultaneous Multithreading (SMT)
Bir neçə thread eyni vaxtda eyni core-da.
graph TB
subgraph "SMT Core (Hyper-Threading)"
Fetch[Shared Fetch]
Fetch --> T1[Thread 1<br/>Context]
Fetch --> T2[Thread 2<br/>Context]
T1 --> EU[Shared<br/>Execution Units]
T2 --> EU
EU --> ALU1[ALU 1]
EU --> ALU2[ALU 2]
EU --> FPU[FPU]
EU --> LSU[Load/Store]
end
Intel Hyper-Threading: 2 thread / core IBM POWER: 4-8 thread / core
Nümunə:
Single-threaded: 40% utilization
2-thread SMT: 70% utilization (1.75x faster)
Multi-Core
Fiziki olaraq ayrı core-lar.
graph TB
subgraph "Multi-Core CPU"
C1[Core 1<br/>Thread 1, 2]
C2[Core 2<br/>Thread 3, 4]
C3[Core 3<br/>Thread 5, 6]
C4[Core 4<br/>Thread 7, 8]
C1 --> L3[Shared L3 Cache]
C2 --> L3
C3 --> L3
C4 --> L3
L3 --> Mem[Main Memory]
end
Üstünlük: Həqiqi parallelizm Çatışmazlıq: Əlavə chip sahəsi, power
TLP Scaling
graph LR
S[Single Core<br/>1x] --> SMT[SMT 2-way<br/>1.3x]
SMT --> D[Dual Core<br/>1.8x]
D --> Q[Quad Core<br/>3.2x]
Q --> O[Octa Core<br/>5.5x]
Perfect scaling yoxdur:
- Synchronization overhead
- Memory bandwidth limit
- Cache contention
- Amdahl's Law
3. Data-Level Parallelism (DLP)
Eyni əməliyyat müxtəlif dataya tətbiq olunur.
SIMD (Single Instruction Multiple Data)
Bir təlimat bir neçə data element-ə.
graph LR
subgraph "Scalar"
A1[a] --> Add1[+]
B1[b] --> Add1
Add1 --> C1[c]
end
subgraph "SIMD (4-wide)"
A[a0 a1 a2 a3] --> Add[+ + + +]
B[b0 b1 b2 b3] --> Add
Add --> C[c0 c1 c2 c3]
end
Vector Processing
x86 SIMD Extensions:
// Scalar: 1 operation
float c = a + b;
// SSE: 4 floats
__m128 va = _mm_load_ps(a); // Load 4 floats
__m128 vb = _mm_load_ps(b);
__m128 vc = _mm_add_ps(va, vb); // Add 4 floats
_mm_store_ps(c, vc);
// AVX: 8 floats
__m256 va = _mm256_load_ps(a);
__m256 vb = _mm256_load_ps(b);
__m256 vc = _mm256_add_ps(va, vb);
_mm256_store_ps(c, vc);
// AVX-512: 16 floats (one instruction!)
SIMD Width Evolution
| Extension | Width | Elements (float) | Year |
|---|---|---|---|
| MMX | 64-bit | 2 | 1997 |
| SSE | 128-bit | 4 | 1999 |
| AVX | 256-bit | 8 | 2011 |
| AVX-512 | 512-bit | 16 | 2016 |
ARM NEON
// ARM NEON intrinsics
float32x4_t va = vld1q_f32(a); // Load 4 floats
float32x4_t vb = vld1q_f32(b);
float32x4_t vc = vaddq_f32(va, vb); // Add
vst1q_f32(c, vc); // Store
Auto-Vectorization
Compiler avtomatik SIMD istifadə edir.
// Simple loop
for (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];
}
// Compiler generates (with -O3 -march=native):
// vectorized loop using AVX
Alignment vacibdir:
// Aligned allocation
float* a = (float*)aligned_alloc(32, N * sizeof(float));
GPU Processing (SIMT)
SIMT: Single Instruction Multiple Thread
graph TB
subgraph "GPU Architecture"
SM1[Streaming<br/>Multiprocessor 1]
SM2[Streaming<br/>Multiprocessor 2]
SM3[Streaming<br/>Multiprocessor N]
SM1 --> Core1[32-64 Cores]
SM2 --> Core2[32-64 Cores]
SM3 --> Core3[32-64 Cores]
Core1 --> Mem[Global Memory]
Core2 --> Mem
Core3 --> Mem
end
CUDA / OpenCL Example:
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
c[i] = a[i] + b[i]; // Thousands of threads!
}
}
4. Task-Level Parallelism
Müxtəlif task-lar paralel icra olunur.
graph TB
Program[Program] --> T1[Task 1<br/>Image Processing]
Program --> T2[Task 2<br/>Audio Decoding]
Program --> T3[Task 3<br/>Network I/O]
Program --> T4[Task 4<br/>UI Updates]
T1 --> C1[Core 1]
T2 --> C2[Core 2]
T3 --> C3[Core 3]
T4 --> C4[Core 4]
Fork-Join Pattern
#pragma omp parallel for
for (int i = 0; i < N; i++) {
process(array[i]);
}
sequenceDiagram
participant Main
participant T1 as Thread 1
participant T2 as Thread 2
participant T3 as Thread 3
participant T4 as Thread 4
Main->>T1: Fork
Main->>T2: Fork
Main->>T3: Fork
Main->>T4: Fork
T1->>T1: Work
T2->>T2: Work
T3->>T3: Work
T4->>T4: Work
T1->>Main: Join
T2->>Main: Join
T3->>Main: Join
T4->>Main: Join
Pipeline Pattern
graph LR
I[Input] --> S1[Stage 1<br/>Thread 1]
S1 --> S2[Stage 2<br/>Thread 2]
S2 --> S3[Stage 3<br/>Thread 3]
S3 --> O[Output]
Nümunə: Video encoding
- Stage 1: Read frame
- Stage 2: Encode
- Stage 3: Write output
Hyper-Threading Details
Intel Hyper-Threading (SMT)
Replicated Resources
Her thread üçün ayrı:
- Program Counter
- Registers
- Flags
Shared Resources
Thread-lər paylaşır:
- Execution units (ALU, FPU)
- Caches (L1, L2)
- TLB
graph TB
subgraph "HT-Enabled Core"
subgraph "Thread 1"
PC1[PC]
Reg1[Registers]
end
subgraph "Thread 2"
PC2[PC]
Reg2[Registers]
end
subgraph "Shared"
ALU[ALUs]
FPU[FPU]
Cache[L1/L2 Cache]
end
PC1 --> ALU
PC2 --> ALU
Reg1 --> FPU
Reg2 --> FPU
end
Performance Impact
Workload A (compute-heavy):
Single-thread: 100%
HT (2 threads): 130%
Workload B (memory-heavy):
Single-thread: 100%
HT (2 threads): 170%
Cache miss zamanı digər thread işləyir!
Parallelizm Challenges
1. Synchronization Overhead
// High overhead - lock per iteration
for (int i = 0; i < N; i++) {
lock();
shared_counter++;
unlock();
}
// Low overhead - local accumulation
int local = 0;
for (int i = 0; i < N; i++) {
local++;
}
lock();
shared_counter += local;
unlock();
2. Load Balancing
graph TB
subgraph "Poor Balance"
T1[Thread 1<br/>80% work]
T2[Thread 2<br/>10% work]
T3[Thread 3<br/>10% work]
end
subgraph "Good Balance"
T4[Thread 1<br/>33% work]
T5[Thread 2<br/>33% work]
T6[Thread 3<br/>34% work]
end
3. False Sharing
struct Counter {
int count1; // Thread 1
int count2; // Thread 2
}; // Same cache line!
// Solution: Padding
struct Counter {
int count1;
char pad[60];
int count2;
};
4. Memory Bandwidth
Single core: 20 GB/s
4 cores: 40 GB/s (not 80!)
8 cores: 60 GB/s
Memory bandwidth limit!
Parallelism Patterns
Embarrassingly Parallel
Heç bir communication yoxdur.
// Perfect parallelism
#pragma omp parallel for
for (int i = 0; i < N; i++) {
output[i] = expensive_computation(input[i]);
}
Scalability: Near-linear
Reduction
Combine results.
int sum = 0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; i++) {
sum += array[i];
}
Producer-Consumer
Queue<Task> queue;
// Producer thread
while (has_work) {
queue.push(create_task());
}
// Consumer threads
while (true) {
Task t = queue.pop();
process(t);
}
Performance Metrics
Speedup
Speedup = T(1) / T(N)
Burada T(N) = N processor ilə vaxt
Efficiency
Efficiency = Speedup / N
Ideal: 1.0 (100%) Tipik: 0.7-0.9 (70-90%)
Scalability
graph LR
N[Cores] --> S[Speedup]
1 --> 1x[1.0x]
2 --> 2x[1.8x]
4 --> 4x[3.2x]
8 --> 8x[5.5x]
16 --> 16x[9.0x]
Heterogeneous Computing
Müxtəlif processor növləri birlikdə.
graph TB
App[Application]
App --> CPU[CPU<br/>Serial + Control]
App --> GPU[GPU<br/>Parallel Compute]
App --> FPGA[FPGA<br/>Custom Logic]
App --> DSP[DSP<br/>Signal Processing]
CPU --> Mem[Unified Memory]
GPU --> Mem
FPGA --> Mem
DSP --> Mem
Nümunə: Apple M1/M2
- CPU cores (performance + efficiency)
- GPU cores
- Neural Engine
- Video encode/decode
Best Practices
-
Profile first
- Parallel hissəni tap
- Bottleneck-ləri müəyyənləşdir
-
Coarse-grained parallelism
- Böyük task-lar
- Az synchronization
-
Minimize communication
- Local data
- Batch operations
-
Use libraries
- OpenMP, TBB, PPL
- BLAS, cuBLAS
-
Test scaling
- 1, 2, 4, 8... cores
- Measure speedup
Əlaqəli Mövzular
- CPU Architecture: Superscalar və out-of-order
- Multiprocessor Systems: Multi-core architecture
- Synchronization: Lock-free algorithms
- Performance: Amdahl's Law və scaling
- Cache Memory: False sharing