Modern CPU Arxitekturaları

Architecture Overview

graph TB
    Architectures[Modern CPU Architectures] --> CISC[CISC<br/>Complex Instruction Set]
    Architectures --> RISC[RISC<br/>Reduced Instruction Set]
    
    CISC --> x86[x86-64<br/>Intel, AMD]
    
    RISC --> ARM[ARM<br/>Mobile, Server]
    RISC --> RISCV[RISC-V<br/>Open source]
    RISC --> MIPS[MIPS<br/>Embedded]
    RISC --> PowerPC[PowerPC<br/>Legacy]
    
    x86 --> Desktop[Desktop/Server<br/>Dominant]
    ARM --> Mobile[Mobile<br/>Dominant]
    RISCV --> Emerging[Emerging<br/>IoT, Custom]

x86-64 Architecture

History

timeline
    title x86 Evolution
: 8086 (16-bit)
: 80386 (32-bit)
: AMD64 / x86-64 (64-bit)
: Nehalem (Intel - integrated memory controller)
: Sandy Bridge (AVX, ring bus)
: Ryzen (AMD - Zen architecture)
: Zen 3 (AMD) / Tiger Lake (Intel)
: Zen 4 (AMD) / Raptor Lake (Intel)

x86-64 Xüsusiyyətləri

CISC fəlsəfəsi:

Variable-length instructions (1-15 bytes)
Complex addressing modes
Microcode (CISC → RISC micro-ops)
Backward compatibility (1978-dən)

Registers:

General Purpose (64-bit):
RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP
R8, R9, R10, R11, R12, R13, R14, R15

SIMD (Vector):
XMM0-XMM15 (128-bit) - SSE
YMM0-YMM15 (256-bit) - AVX
ZMM0-ZMM31 (512-bit) - AVX-512

Segment Registers:
CS, DS, SS, ES, FS, GS (mostly legacy)

Special:
RIP (Instruction Pointer)
RFLAGS (Status flags)

Memory Model:

Canonical addresses (48-bit actually used):
User space:   0x0000000000000000 - 0x00007FFFFFFFFFFF
Kernel space: 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF

4-level page table (5-level on new CPUs)

Intel vs AMD

Intel Microarchitecture (2023):

graph TB
    Core[Intel Core<br/>Raptor Lake] --> PCore[P-Cores<br/>Performance]
    Core --> ECore[E-Cores<br/>Efficiency]
    
    PCore --> Golden[Golden Cove<br/>Microarchitecture]
    ECore --> Gracemont[Gracemont<br/>Microarchitecture]
    
    Golden --> Wide[Wide execution<br/>6-wide decode]
    Gracemont --> Narrow[Narrow execution<br/>Efficient]

AMD Microarchitecture (2023):

graph TB
    Zen4[AMD Zen 4] --> Core[CPU Core]
    Zen4 --> L3[L3 Cache<br/>32MB per CCD]
    Zen4 --> IOD[I/O Die<br/>Memory, PCIe]
    
    Core --> Frontend[Front-end<br/>4-wide decode]
    Core --> Backend[Back-end<br/>10 execution ports]
    Core --> L1[L1: 32KB I + 32KB D]
    Core --> L2[L2: 1MB per core]

Performance Comparison (2023)

Xüsusiyyət	Intel Core i9-13900K	AMD Ryzen 9 7950X
Architecture	Raptor Lake	Zen 4
Cores	24 (8P + 16E)	16 (all equal)
Threads	32	32
Base Clock	3.0 GHz (P) / 2.2 GHz (E)	4.5 GHz
Boost Clock	5.8 GHz	5.7 GHz
L3 Cache	36 MB	64 MB
TDP	125W (PL1) / 253W (PL2)	170W
Process	Intel 7 (10nm)	TSMC 5nm
PCIe	Gen 5.0 (16 lanes)	Gen 5.0 (24 lanes)
DDR Support	DDR4/DDR5	DDR5 only

Use case:

Intel: Better single-thread, gaming
AMD: Better multi-thread, productivity

ARM Architecture

ARM Xüsusiyyətləri

RISC fəlsəfəsi:

Fixed-length instructions (32-bit)
Load/Store architecture
Simple addressing modes
Energy efficient

graph TB
    ARM[ARM Architecture] --> ISA[Instruction Set]
    ARM --> Power[Power Efficiency]
    ARM --> Scale[Scalability]
    
    ISA --> ARM32[ARMv7<br/>32-bit<br/>Legacy]
    ISA --> ARM64[ARMv8/ARMv9<br/>64-bit<br/>Modern]
    
    Power --> Low[Low power<br/>Mobile, IoT]
    Power --> High[High performance<br/>Server, Desktop]
    
    Scale --> Cortex[Cortex Series]
    Cortex --> A[Cortex-A<br/>Application]
    Cortex --> R[Cortex-R<br/>Real-time]
    Cortex --> M[Cortex-M<br/>Microcontroller]

ARMv8/ARMv9 Registers

General Purpose (64-bit):
X0-X30 (64-bit) or W0-W30 (32-bit lower half)
X30 = Link Register (LR)
XZR = Zero Register

Special:
SP  = Stack Pointer
PC  = Program Counter

SIMD/FP:
V0-V31 (128-bit) - NEON
Can be accessed as:
- Q0-Q31 (128-bit)
- D0-D31 (64-bit)
- S0-S31 (32-bit)
- H0-H31 (16-bit)
- B0-B31 (8-bit)

ARM Instruction Example

// ARM64 assembly
add x0, x1, x2        // x0 = x1 + x2
ldr x0, [x1, #8]      // Load from memory: x0 = *(x1 + 8)
str x0, [x1, #8]      // Store to memory: *(x1 + 8) = x0
cmp x0, x1            // Compare x0 and x1
b.eq label            // Branch if equal
ret                   // Return (jump to LR)

Load/Store architecture:

// Cannot do: add x0, [memory], x1
// Must do:
ldr x2, [memory]      // Load
add x0, x2, x1        // Compute
str x0, [result]      // Store

ARM Ecosystem

graph TB
    ARM[ARM Ltd] -->|License| Vendors[Chip Vendors]
    
    Vendors --> Qualcomm[Qualcomm<br/>Snapdragon]
    Vendors --> Apple[Apple<br/>A-series, M-series]
    Vendors --> Samsung[Samsung<br/>Exynos]
    Vendors --> Mediatek[MediaTek<br/>Dimensity]
    Vendors --> Amazon[Amazon<br/>Graviton]
    Vendors --> Ampere[Ampere<br/>Altra]
    
    Qualcomm --> Mobile1[Mobile phones]
    Apple --> Mobile2[iPhone, iPad, Mac]
    Samsung --> Mobile3[Galaxy phones]
    Amazon --> Server1[AWS EC2]
    Ampere --> Server2[Cloud servers]

ARM Server CPUs (2023)

CPU	Vendor	Cores	Clock	TDP	Use Case
Graviton3	Amazon	64	2.6 GHz	~300W	AWS cloud
Altra Max	Ampere	128	3.0 GHz	250W	Cloud/HPC
Neoverse V2	ARM (ref)	Scalable	-	-	Reference design

Apple Silicon

M1/M2/M3 Architecture

graph TB
    M1[Apple Silicon<br/>M1/M2/M3] --> SoC[System on Chip]
    
    SoC --> CPU[CPU Cluster]
    SoC --> GPU[Integrated GPU]
    SoC --> Neural[Neural Engine]
    SoC --> Memory[Unified Memory]
    SoC --> Media[Media Engines]
    SoC --> SecureEnclave[Secure Enclave]
    
    CPU --> Perf[Performance Cores<br/>Firestorm/Avalanche]
    CPU --> Eff[Efficiency Cores<br/>Icestorm/Blizzard]
    
    Memory --> LPDDR[LPDDR5<br/>High bandwidth<br/>Low latency]

Apple M-series Comparison

Model	M1	M2	M3	M1 Ultra
Launch	2020	2022	2023	2022
Process	TSMC 5nm	TSMC 5nm	TSMC 3nm	2×M1 Max
P-Cores	4	4	4	16
E-Cores	4	4	4	16
GPU Cores	7-8	8-10	10	48-64
Neural Engine	16-core	16-core	16-core	32-core
Memory	8-16 GB	8-24 GB	8-24 GB	64-128 GB
Bandwidth	68 GB/s	100 GB/s	100 GB/s	800 GB/s
TDP	~15W	~20W	~20W	~60W

Unified Memory Architecture

graph LR
    CPU[CPU] <--> Memory[Unified Memory<br/>LPDDR5<br/>Shared pool]
    GPU[GPU] <--> Memory
    Neural[Neural Engine] <--> Memory
    Media[Media Engines] <--> Memory
    
    Memory -->|High Bandwidth| Fast[100-800 GB/s<br/>depending on model]

Üstünlüklər:

Zero-copy between CPU/GPU
Lower latency
Better power efficiency
Simpler programming model

Çatışmazlıqlar:

Not upgradeable
Shared bandwidth

M1 Performance

Single-thread:

Comparable to Intel Core i9 / AMD Ryzen 9
Much lower power (~5W vs 125W+)

Efficiency:

~2-3× performance per watt vs x86

GPU:

Integrated GPU competitive with mid-range discrete GPUs
Excellent for content creation (video encoding/decoding)

RISC-V

RISC-V Xüsusiyyətləri

Open-source ISA:

Free to use, no licensing fees
Modular design
Extensible
Simple and elegant

graph TB
    RISCV[RISC-V ISA] --> Base[Base ISA]
    RISCV --> Extensions[Extensions]
    
    Base --> RV32[RV32I<br/>32-bit integer]
    Base --> RV64[RV64I<br/>64-bit integer]
    Base --> RV128[RV128I<br/>128-bit<br/>Future]
    
    Extensions --> M[M: Multiply/Divide]
    Extensions --> A[A: Atomic]
    Extensions --> F[F: Single FP]
    Extensions --> D[D: Double FP]
    Extensions --> C[C: Compressed<br/>16-bit instructions]
    Extensions --> V[V: Vector]

RISC-V Registers

Integer Registers:
x0 (zero) - Hardwired to 0
x1 (ra)   - Return address
x2 (sp)   - Stack pointer
x3 (gp)   - Global pointer
x4 (tp)   - Thread pointer
x5-x7, x28-x31 (t0-t6) - Temporaries
x8-x9, x18-x27 (s0-s11) - Saved registers
x10-x17 (a0-a7) - Function arguments/return values

Floating-Point Registers:
f0-f31

Vector Registers (V extension):
v0-v31

RISC-V Instruction Format

R-type (Register):
[funct7 | rs2 | rs1 | funct3 | rd | opcode]
 7 bits   5     5      3       5     7

Example: add x1, x2, x3  // x1 = x2 + x3

I-type (Immediate):
[immediate | rs1 | funct3 | rd | opcode]
  12 bits    5      3       5     7

Example: addi x1, x2, 100  // x1 = x2 + 100

Load: lw x1, 8(x2)  // x1 = *(x2 + 8)

RISC-V Ecosystem

graph TB
    RISCV[RISC-V] --> SiFive[SiFive<br/>Commercial cores]
    RISCV --> Alibaba[Alibaba<br/>T-Head XuanTie]
    RISCV --> WD[Western Digital<br/>SweRV]
    RISCV --> OpenHW[OpenHW Group<br/>Open-source cores]
    RISCV --> Google[Google<br/>Android support]
    
    SiFive --> Embedded1[Embedded, IoT]
    Alibaba --> Cloud[Cloud, Edge]
    WD --> Storage[Storage controllers]
    OpenHW --> Research[Research, Education]

Use cases:

Embedded systems
IoT devices
Custom accelerators
Research and education
Future: Desktop/Server (emerging)

GPU Architecture Basics

GPU vs CPU

graph TB
    subgraph CPU
        CPU_Core1[Core 1<br/>Complex<br/>Low latency]
        CPU_Core2[Core 2]
        CPU_Core3[Core 4]
        CPU_Core4[Core 8]
        CPU_Cache[Large Cache]
    end
    
    subgraph GPU
        GPU_SM1[SM 1<br/>Simple<br/>High throughput]
        GPU_SM2[SM 2]
        GPU_SM3[SM ...]
        GPU_SM4[SM 100+]
        GPU_Cores[1000s of cores]
    end

Aspect	CPU	GPU
Design	Few complex cores	Many simple cores
Threads	10s-100s	1000s-10000s
Latency	Optimized for low latency	High latency tolerated
Cache	Large (MB)	Small (KB per core)
Control Flow	Good branch prediction	SIMT (threads diverge = slow)
Use Case	General purpose	Parallel workloads

CUDA Architecture (NVIDIA)

graph TB
    GPU[GPU] --> SM[Streaming Multiprocessor<br/>SM]
    
    SM --> Cores[CUDA Cores<br/>64-128 per SM]
    SM --> Tensor[Tensor Cores<br/>AI/ML]
    SM --> RT[RT Cores<br/>Ray tracing]
    SM --> SharedMem[Shared Memory<br/>~100 KB]
    SM --> Registers[Register File<br/>64K 32-bit registers]
    
    GPU --> GlobalMem[Global Memory<br/>GDDR6/HBM<br/>GBs]

Execution Model:

Grid (entire GPU kernel)
├── Block 1 (executed on 1 SM)
│   ├── Warp 1 (32 threads, lockstep)
│   ├── Warp 2
│   └── ...
├── Block 2
└── ...

Warp:

32 threads execute together (SIMT)
Same instruction, different data
Branch divergence → serialize

GPU Programming Example

// CUDA kernel
__global__ void vectorAdd(float* a, float* b, float* c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) {
        c[i] = a[i] + b[i];
    }
}

// Host code
int n = 1000000;
int blockSize = 256;
int numBlocks = (n + blockSize - 1) / blockSize;

vectorAdd<<<numBlocks, blockSize>>>(d_a, d_b, d_c, n);

Memory Hierarchy:

Registers:        ~1 cycle   (per-thread)
Shared Memory:    ~5 cycles  (per-block)
L1 Cache:         ~10 cycles
L2 Cache:         ~100 cycles
Global Memory:    ~200-400 cycles

Modern GPU Comparison (2023)

GPU	Vendor	Architecture	Cores	Memory	Bandwidth	TDP	Use Case
RTX 4090	NVIDIA	Ada Lovelace	16384 CUDA	24GB GDDR6X	1008 GB/s	450W	Gaming, AI
RX 7900 XTX	AMD	RDNA 3	12288 Stream	24GB GDDR6	960 GB/s	355W	Gaming
H100	NVIDIA	Hopper	16896 CUDA	80GB HBM3	3350 GB/s	700W	Data center AI
A100	NVIDIA	Ampere	6912 CUDA	40-80GB HBM2e	1555-2039 GB/s	400W	Data center

Architecture Comparison

Instruction Set

ISA	Type	Complexity	Compatibility	Power	Performance
x86-64	CISC	High	Excellent (40+ years)	Moderate	Excellent
ARM	RISC	Low-Medium	Good	Excellent	Good-Excellent
RISC-V	RISC	Low	Emerging	Excellent	Good

pie title Desktop/Laptop
    "x86-64 (Intel/AMD)" : 85
    "ARM (Apple)" : 10
    "Other" : 5

pie title Mobile/Tablet
    "ARM" : 99
    "Other" : 1

pie title Server
    "x86-64" : 90
    "ARM" : 8
    "Other" : 2

Performance per Watt

graph LR
    Low[Low Performance/Watt] --> x86[x86-64<br/>High power<br/>High perf]
    x86 --> ARM[ARM<br/>Balanced]
    ARM --> AppleSilicon[Apple Silicon<br/>Excellent efficiency]
    AppleSilicon --> High[High Performance/Watt]

Heterogeneous Computing

big.LITTLE (ARM)

graph TB
    Scheduler[OS Scheduler] --> Big[Big Cores<br/>High performance<br/>High power]
    Scheduler --> Little[Little Cores<br/>Low performance<br/>Low power]
    
    Big --> Heavy[Heavy workloads<br/>Gaming, Video]
    Little --> Light[Light workloads<br/>Background, UI]

DynamIQ:

More flexible than big.LITTLE
Mix different core types in same cluster
Better migration

Intel Hybrid (P/E Cores)

graph TB
    Thread[Thread Director<br/>Hardware scheduler] --> PCore[P-Cores<br/>Performance]
    Thread --> ECore[E-Cores<br/>Efficiency]
    
    PCore --> FG[Foreground tasks<br/>Games, Apps]
    ECore --> BG[Background tasks<br/>OS, Services]

Alder Lake onwards (2021+):

P-cores: High performance, out-of-order, SMT
E-cores: High efficiency, simpler, no SMT
Thread Director: Hardware hints to OS

Future Trends

1. Chiplet Design

graph TB
    Package[Package] --> Core1[Core Chiplet 1]
    Package --> Core2[Core Chiplet 2]
    Package --> IOD[I/O Die]
    Package --> HBM[HBM Memory]
    
    Core1 <--> IOD
    Core2 <--> IOD
    IOD <--> HBM

Benefits:

Better yields (small dies)
Mix-and-match components
Scalability

Examples:

AMD Ryzen/EPYC (Zen 2+)
Intel Sapphire Rapids

2. 3D Stacking

graph TB
    Top[Cache Die<br/>SRAM] ---|3D V-Cache| Middle[Compute Die<br/>CPU cores]
    Middle ---|TSV| Bottom[I/O Die<br/>Memory controller]

AMD 3D V-Cache:

Stack L3 cache on top of cores
96MB L3 (Ryzen 7 5800X3D)
20-30% gaming performance boost

3. Custom Silicon

graph LR
    Google[Google<br/>TPU] --> ML1[Machine Learning]
    Amazon[Amazon<br/>Graviton, Trainium] --> Cloud[Cloud/AI]
    Tesla[Tesla<br/>FSD Chip] --> Auto[Autonomous Driving]
    Apple[Apple<br/>M-series] --> Consumer[Consumer devices]

4. Open Source Hardware

RISC-V adoption growing
OpenPOWER
Open-source GPU initiatives (e.g., Nyuzi)

5. Quantum Computing

Classical bit: 0 or 1
Qubit: Superposition of 0 and 1

Not replacement for classical, but for specific problems:
- Cryptography
- Optimization
- Simulation

Best Practices

1. Architecture Selection

x86-64 if:

Need maximum single-thread performance
Software compatibility critical
Desktop/gaming

ARM if:

Power efficiency important
Mobile/embedded
Modern software stack

RISC-V if:

Custom hardware
No licensing costs
Embedded/IoT

2. Cross-Platform Development

// Portable code
#ifdef __x86_64__
    #include <immintrin.h>  // x86 intrinsics
#elif __aarch64__
    #include <arm_neon.h>   // ARM NEON
#endif

// Abstract SIMD operations
typedef __m128 vec4f;  // x86
typedef float32x4_t vec4f;  // ARM

3. Performance Tuning

// x86-64: Focus on cache, branch prediction
// ARM: Focus on power, data access patterns
// GPU: Focus on parallelism, memory coalescing

4. Profiling Tools

# x86
perf stat ./program
Intel VTune

# ARM
perf (Linux)
Instruments (Apple)
Streamline (ARM)

# GPU
nvprof (NVIDIA)
Nsight (NVIDIA)

Əlaqəli Mövzular

CPU Architecture: Core components
ISA: Instruction sets
Performance: Optimization techniques
Parallelism: Multi-core, SIMD
Power Management: Efficiency cores
Memory Hierarchy: Different architectures

Architecture Overview​

x86-64 Architecture​

History​

x86-64 Xüsusiyyətləri​

Intel vs AMD​

Performance Comparison (2023)​

ARM Architecture​

ARM Xüsusiyyətləri​

ARMv8/ARMv9 Registers​

ARM Instruction Example​

ARM Ecosystem​

ARM Server CPUs (2023)​

Apple Silicon​

M1/M2/M3 Architecture​

Apple M-series Comparison​

Unified Memory Architecture​

M1 Performance​

RISC-V​

RISC-V Xüsusiyyətləri​

RISC-V Registers​

RISC-V Instruction Format​

RISC-V Ecosystem​

GPU Architecture Basics​

GPU vs CPU​

CUDA Architecture (NVIDIA)​

GPU Programming Example​

Modern GPU Comparison (2023)​

Architecture Comparison​

Instruction Set​

Market Share (2023)​

Performance per Watt​

Heterogeneous Computing​

big.LITTLE (ARM)​

Intel Hybrid (P/E Cores)​

Future Trends​

1. Chiplet Design​

2. 3D Stacking​

3. Custom Silicon​

4. Open Source Hardware​

5. Quantum Computing​

Best Practices​

1. Architecture Selection​

2. Cross-Platform Development​

3. Performance Tuning​

4. Profiling Tools​

Əlaqəli Mövzular​