Storage Arxitekturası

Storage Hierarchy

graph TB
    CPU[CPU Registers<br/>~1 ns<br/>Bytes] --> L1[L1 Cache<br/>~1 ns<br/>KB]
    L1 --> L2[L2 Cache<br/>~3 ns<br/>MB]
    L2 --> L3[L3 Cache<br/>~10 ns<br/>MB]
    L3 --> RAM[RAM<br/>~100 ns<br/>GB]
    RAM --> SSD[SSD<br/>~100 µs<br/>TB]
    SSD --> HDD[HDD<br/>~10 ms<br/>TB]
    HDD --> Tape[Tape/Archive<br/>seconds<br/>PB]
    
    style CPU fill:#f96
    style L1 fill:#fc9
    style L2 fill:#ff9
    style L3 fill:#9f9
    style RAM fill:#9cf
    style SSD fill:#c9f
    style HDD fill:#f9c
    style Tape fill:#ccc

Trade-off: Speed ↔ Capacity ↔ Cost

Hard Disk Drive (HDD)

HDD Strukturu

graph TB
    HDD[HDD Components] --> Platter[Platters<br/>Magnetic disks]
    HDD --> Spindle[Spindle<br/>Rotation motor]
    HDD --> Head[Read/Write Heads]
    HDD --> Arm[Actuator Arm]
    HDD --> Controller[Controller]
    
    Platter --> Tracks[Tracks<br/>Concentric circles]
    Tracks --> Sectors[Sectors<br/>512B or 4KB]
    Tracks --> Cylinders[Cylinders<br/>Same track across platters]

Əsas parametrlər:

RPM (Rotations Per Minute): 5400, 7200, 10000, 15000
Capacity: 1TB - 20TB
Interface: SATA, SAS
Cache: 64MB - 256MB

HDD Performance

Access time = Seek time + Rotational latency + Transfer time

sequenceDiagram
    participant Controller
    participant Arm as Actuator Arm
    participant Head as Read/Write Head
    participant Platter
    
    Controller->>Arm: Move to track
    Note over Arm: Seek time (~5-10 ms)
    
    Arm->>Head: Position head
    Note over Platter: Wait for rotation
    Note over Head: Rotational latency (~4-8 ms)
    
    Head->>Platter: Read/Write
    Note over Head: Transfer time (~0.1 ms)

Nümunə hesablama:

Seek time:           8 ms (average)
Rotational latency:  4.17 ms (7200 RPM → 60000/7200/2)
Transfer time:       0.1 ms (512KB at 100 MB/s)
-----------------
Total:              ~12.3 ms per random access

Random IOPS: 1000 / 12.3 ≈ 80 IOPS
Sequential: ~150-200 MB/s

Disk Scheduling Algorithms

1. FCFS (First-Come, First-Served)

Request queue: 98, 183, 37, 122, 14, 124, 65, 67
Head position: 53

Order: 53 → 98 → 183 → 37 → 122 → 14 → 124 → 65 → 67
Total movement: 45+85+146+85+108+110+59+2 = 640 cylinders

2. SSTF (Shortest Seek Time First)

Head: 53
Order: 53 → 65 → 67 → 37 → 14 → 98 → 122 → 124 → 183
Total movement: 12+2+30+23+84+24+2+59 = 236 cylinders

Problem: Starvation (far requests may never be served)

3. SCAN (Elevator Algorithm)

Head: 53, Direction: Right
Order: 53 → 65 → 67 → 98 → 122 → 124 → 183 → [end] → 37 → 14
Total movement: 183-53 + 183-14 = 130 + 169 = 299 cylinders

4. C-SCAN (Circular SCAN)

Head: 53, Direction: Right
Order: 53 → 65 → 67 → 98 → 122 → 124 → 183 → [jump to 0] → 14 → 37
Total movement: 130 + 183 + 37 = 350 cylinders

5. LOOK / C-LOOK

SCAN kimi, amma end-ə deyil, ən son request-ə qədər.

graph LR
    FCFS[FCFS<br/>Fair<br/>Poor performance]
    SSTF[SSTF<br/>Best performance<br/>Starvation]
    SCAN[SCAN<br/>No starvation<br/>Good performance]
    CLOOK[C-LOOK<br/>Better than SCAN<br/>Linux default]

Solid State Drive (SSD)

SSD Strukturu

graph TB
    SSD[SSD] --> Controller[SSD Controller<br/>FTL firmware]
    SSD --> DRAM[DRAM Cache<br/>Mapping table]
    SSD --> Flash[NAND Flash Memory]
    
    Flash --> Blocks[Blocks<br/>~256-512 KB]
    Blocks --> Pages[Pages<br/>4-16 KB]
    
    Controller --> FTL[Flash Translation Layer<br/>Logical → Physical]
    Controller --> GC[Garbage Collection]
    Controller --> WL[Wear Leveling]

NAND Flash növləri:

Tip	Bits/Cell	Speed	Endurance	Cost	Use Case
SLC	1	Fastest	~100k P/E	Highest	Enterprise
MLC	2	Fast	~10k P/E	High	Consumer high-end
TLC	3	Medium	~3k P/E	Medium	Consumer
QLC	4	Slow	~1k P/E	Low	Archive

P/E Cycles = Program/Erase cycles

Flash Translation Layer (FTL)

sequenceDiagram
    participant OS
    participant FTL
    participant Flash
    
    OS->>FTL: Write to LBA 100
    FTL->>FTL: Check mapping table
    FTL->>Flash: Write to Physical Block 532
    FTL->>FTL: Update mapping: LBA 100 → PBA 532
    
    Note over FTL: Old data (if exists) marked invalid

FTL vəzifələri:

Logical-to-Physical mapping
Wear leveling
Garbage collection
Bad block management

Wear Leveling

NAND flash-ın hər block-u məhdud sayda yazıla bilər.

graph TB
    WL[Wear Leveling] --> Static[Static Wear Leveling]
    WL --> Dynamic[Dynamic Wear Leveling]
    
    Static --> S1[Rarely-written data<br/>moved to high-wear blocks]
    Dynamic --> D1[Frequently-written data<br/>spread across blocks]

Nümunə:

Block 0: 5000 P/E cycles
Block 1: 2000 P/E cycles
Block 2: 100 P/E cycles

FTL action: Write new data to Block 2 (lowest wear)
If Block 2 has static data → move to Block 0, use Block 2 for new writes

Garbage Collection

sequenceDiagram
    participant OS
    participant SSD
    
    Note over SSD: Block has valid & invalid pages
    
    OS->>SSD: Write request
    SSD->>SSD: No free pages!
    
    Note over SSD: Trigger GC
    SSD->>SSD: 1. Read valid pages
    SSD->>SSD: 2. Write to new block
    SSD->>SSD: 3. Erase old block
    SSD->>SSD: 4. Now have free pages
    
    SSD->>OS: Write complete

Write Amplification:

Write Amplification Factor (WAF) = Data written to flash / Data written by host

Example:
Host writes 1 GB
SSD writes 1 GB (new) + 500 MB (GC moves valid data)
WAF = 1.5 GB / 1 GB = 1.5

WAF minimize etmək:

Over-provisioning (extra capacity for GC)
TRIM command (OS tells SSD which blocks are free)
Write less frequently

TRIM Command

# Linux - manual TRIM
fstrim -v /

# Enable periodic TRIM
systemctl enable fstrim.timer

# Check if SSD supports TRIM
lsblk -D

sequenceDiagram
    participant OS
    participant FS as File System
    participant SSD
    
    OS->>FS: Delete file
    FS->>FS: Mark blocks as free
    FS->>SSD: TRIM command (LBAs)
    SSD->>SSD: Mark physical blocks as invalid
    
    Note over SSD: No need to preserve data<br/>Better GC performance

HDD vs SSD Comparison

Xüsusiyyət	HDD	SSD
Speed
Sequential Read	100-200 MB/s	500-7000 MB/s
Sequential Write	100-200 MB/s	500-7000 MB/s
Random IOPS	80-160	10k-1M
Latency	5-10 ms	0.1 ms
Physical
Shock resistant	No (moving parts)	Yes
Noise	Yes	No
Power	6-10W	2-5W
Heat	More	Less
Cost & Capacity
$/GB	~$0.02	~$0.10
Max capacity	20 TB	8 TB (consumer)
Lifespan	3-5 years	5-10 years
Use Case
Large storage	✓
Boot drive		✓
Databases		✓
Archive	✓

NVMe (Non-Volatile Memory Express)

NVMe vs SATA

graph TB
    subgraph SATA
        SATA_OS[OS] --> SATA_AHCI[AHCI Driver]
        SATA_AHCI --> SATA_Controller[SATA Controller]
        SATA_Controller --> SATA_SSD[SSD]
        
        Note1[Single queue<br/>32 commands]
    end
    
    subgraph NVMe
        NVMe_OS[OS] --> NVMe_Driver[NVMe Driver]
        NVMe_Driver --> PCIe[PCIe]
        PCIe --> NVMe_SSD[NVMe SSD]
        
        Note2[64k queues<br/>64k commands each]
    end

Xüsusiyyət	SATA	NVMe
Interface	SATA (6 Gbps)	PCIe 3.0 x4 (32 Gbps)
Max speed	~550 MB/s	~3500 MB/s (PCIe 3.0)
		~7000 MB/s (PCIe 4.0)
Queue depth	32	64k
Queues	1	64k
Latency	Higher	Lower (less protocol overhead)
CPU overhead	Higher	Lower (direct PCIe)

NVMe Architecture

graph TB
    App[Application] --> Kernel[Kernel]
    Kernel --> Driver[NVMe Driver]
    
    Driver --> SQ[Submission Queue<br/>Commands to device]
    Driver --> CQ[Completion Queue<br/>Results from device]
    
    SQ --> Device[NVMe Device]
    Device --> CQ
    
    Device --> Flash[NAND Flash]
    Device --> Controller[NVMe Controller]

Command flow:

Driver writes command to Submission Queue
Driver rings doorbell register (notify device)
Device fetches command via DMA
Device processes command
Device writes completion entry to Completion Queue
Device sends interrupt (or driver polls)
Driver reads Completion Queue

NVMe Command Example

struct nvme_rw_command {
    uint8_t  opcode;      // READ or WRITE
    uint8_t  flags;
    uint16_t command_id;
    uint32_t nsid;        // Namespace ID
    uint64_t slba;        // Starting LBA
    uint16_t length;      // Number of blocks
    // ... more fields
    uint64_t prp1;        // Physical Region Page 1
    uint64_t prp2;        // Physical Region Page 2
};

// Submit read command
void nvme_read(uint64_t lba, uint16_t blocks, void* buffer) {
    struct nvme_rw_command cmd = {0};
    cmd.opcode = NVME_CMD_READ;
    cmd.nsid = 1;
    cmd.slba = lba;
    cmd.length = blocks - 1;  // 0-based
    cmd.prp1 = virt_to_phys(buffer);
    
    // Write to submission queue
    submit_queue[sq_tail] = cmd;
    sq_tail = (sq_tail + 1) % queue_size;
    
    // Ring doorbell
    writel(sq_tail, doorbell_register);
}

NVMe Performance

PCIe 3.0 x4: ~4 GB/s theoretical, ~3.5 GB/s real
PCIe 4.0 x4: ~8 GB/s theoretical, ~7 GB/s real
PCIe 5.0 x4: ~16 GB/s theoretical, ~14 GB/s real

Random 4K IOPS: 500k - 1M
Latency: 10-20 µs

RAID (Redundant Array of Independent Disks)

RAID Levels

RAID 0 - Striping

graph LR
    Data[Data: ABCDEFGH] --> Split{Split}
    Split --> Disk1[Disk 1<br/>ACEG]
    Split --> Disk2[Disk 2<br/>BDFH]

Capacity: N × disk_size
Performance: N × speed
Redundancy: None (any disk fails → data lost)
Use case: Performance, temporary data

RAID 1 - Mirroring

graph LR
    Data[Data: ABCDEFGH] --> Mirror{Mirror}
    Mirror --> Disk1[Disk 1<br/>ABCDEFGH]
    Mirror --> Disk2[Disk 2<br/>ABCDEFGH]

Capacity: disk_size
Performance: Read: 2×, Write: 1×
Redundancy: 1 disk failure tolerated
Use case: Critical data

RAID 5 - Striping with Parity

graph LR
    Data[Data: ABCDEF] --> Stripe{Stripe + Parity}
    Stripe --> Disk1[Disk 1<br/>A, B, Parity_CD]
    Stripe --> Disk2[Disk 2<br/>C, Parity_AB, E]
    Stripe --> Disk3[Disk 3<br/>Parity_EF, D, F]

Capacity: (N-1) × disk_size
Performance: Read: fast, Write: slower (parity calc)
Redundancy: 1 disk failure
Use case: General purpose

Parity calculation:

A = 10110101
B = 11001010
Parity = A XOR B = 01111111

If A is lost:
A = B XOR Parity = 11001010 XOR 01111111 = 10110101

RAID 6 - Double Parity

graph LR
    Data[Data: ABCD] --> DP{Double Parity}
    DP --> Disk1[Disk 1<br/>A, P1, Q1]
    DP --> Disk2[Disk 2<br/>B, P2, Q2]
    DP --> Disk3[Disk 3<br/>C, P3, Q3]
    DP --> Disk4[Disk 4<br/>D, P4, Q4]

Capacity: (N-2) × disk_size
Redundancy: 2 disk failures
Use case: High reliability

RAID 10 (1+0) - Mirrored Stripes

graph TB
    Data[Data: ABCDEFGH] --> Stripe{Stripe}
    Stripe --> S1[ACEG]
    Stripe --> S2[BDFH]
    
    S1 --> Mirror1{Mirror}
    S2 --> Mirror2{Mirror}
    
    Mirror1 --> D1[Disk 1<br/>ACEG]
    Mirror1 --> D2[Disk 2<br/>ACEG]
    Mirror2 --> D3[Disk 3<br/>BDFH]
    Mirror2 --> D4[Disk 4<br/>BDFH]

Capacity: N/2 × disk_size
Performance: Excellent
Redundancy: 1 disk per mirror
Use case: High performance + reliability

RAID Comparison

RAID	Capacity	Performance	Redundancy	Min Disks
0	100%	Excellent	None	2
1	50%	Good read	1 disk	2
5	(N-1)/N	Good	1 disk	3
6	(N-2)/N	Good	2 disks	4
10	50%	Excellent	1 per mirror	4

Storage Interfaces

graph TB
    Interfaces[Storage Interfaces] --> SATA[SATA<br/>6 Gbps<br/>~550 MB/s]
    Interfaces --> SAS[SAS<br/>12 Gbps<br/>~1200 MB/s]
    Interfaces --> NVMe[NVMe<br/>PCIe 4.0 x4<br/>~7000 MB/s]
    Interfaces --> USB[USB<br/>USB 3.2: 20 Gbps]
    
    SATA --> Consumer[Consumer HDDs/SSDs]
    SAS --> Enterprise[Enterprise storage]
    NVMe --> HighPerf[High-performance SSDs]
    USB --> External[External drives]

Storage Performance Optimization

1. Alignment

# Check partition alignment
sudo parted /dev/sda align-check optimal 1

# Align to 1MB boundary (optimal for SSDs)
sudo parted /dev/sda mkpart primary 1MiB 100%

Misalignment nəticəsi:

Partition starts at 512B
SSD page size: 4KB

Write 4KB → spans 2 pages → 2 operations instead of 1

2. File System Selection

FS	Best for	Features
ext4	General Linux	Mature, journaling
XFS	Large files	Good performance
Btrfs	Snapshots	CoW, compression
F2FS	SSD	Flash-friendly
NTFS	Windows	Journaling
APFS	macOS	SSD-optimized

3. I/O Scheduler

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

# Schedulers:
# - none: No scheduling (for NVMe)
# - mq-deadline: Good for SSD
# - bfq: Fair queuing (desktop)
# - kyber: Low latency

4. Read-Ahead

# Check read-ahead
sudo blockdev --getra /dev/sda

# Set read-ahead (in 512-byte sectors)
sudo blockdev --setra 256 /dev/sda  # 128 KB

5. Queue Depth

# Check queue depth
cat /sys/block/nvme0n1/queue/nr_requests

# Increase for high-IOPS workloads
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Advanced Topics

1. Over-Provisioning

User-visible capacity: 240 GB
Physical capacity: 256 GB
Over-provisioning: 16 GB (6.25%)

Purpose:
- Reserve space for wear leveling
- Better GC performance
- Maintain performance over time

2. DRAM Cache

graph LR
    Host[Host] --> DRAM[DRAM Cache<br/>Mapping table<br/>Write buffer]
    DRAM --> Flash[NAND Flash]
    
    DRAM -.Power loss.-> Lost[Data lost<br/>without capacitors]

DRAM-less SSDs:

Cheaper
Slower (mapping in HMB - Host Memory Buffer)
Less reliable

3. SLC Cache

graph TB
    Write[Write Request] --> SLC[SLC Cache<br/>Fast TLC treated as SLC]
    
    SLC -->|Cache full| Fold[Folding to TLC]
    Fold --> TLC[TLC Storage<br/>Slower but more capacity]

Performance pattern:

Fast writes (SLC cache): 500 MB/s
Cache full, folding: 100 MB/s (slow!)

4. Write Coalescing

// Instead of:
write(fd, buffer, 4096);   // 4K write
write(fd, buffer, 4096);   // 4K write
write(fd, buffer, 4096);   // 4K write
write(fd, buffer, 4096);   // 4K write

// Coalesce:
write(fd, large_buffer, 16384);  // 16K write (1 flash page)

Best Practices

SSD:
- Enable TRIM
- Don't defragment
- Disable hibernation/swap on consumer SSDs
- Keep 10-20% free space
HDD:
- Regular defragmentation (Windows)
- Avoid excessive head movement
- Use for cold storage
RAID:
- Monitor disk health (SMART)
- Replace failed disks immediately
- Use hot spares
Performance:
- Align partitions
- Choose right file system
- Use appropriate I/O scheduler
- Monitor with iostat, iotop
Reliability:
- Regular backups (3-2-1 rule)
- Monitor disk temperature
- Check SMART attributes

Monitoring Tools

# Disk usage
df -h
lsblk

# I/O statistics
iostat -x 1

# Disk activity
iotop

# SMART status
smartctl -a /dev/sda

# NVMe info
nvme list
nvme smart-log /dev/nvme0n1

# Benchmark
fio --name=test --rw=randread --bs=4k --size=1G

Əlaqəli Mövzular

I/O Systems: DMA, interrupts
Memory Hierarchy: Caching strategies
File Systems: Storage management
Performance: I/O optimization
Reliability: RAID, backups

Storage Hierarchy​

Hard Disk Drive (HDD)​

HDD Strukturu​

HDD Performance​

Disk Scheduling Algorithms​

Solid State Drive (SSD)​

SSD Strukturu​

Flash Translation Layer (FTL)​

Wear Leveling​

Garbage Collection​

TRIM Command​

HDD vs SSD Comparison​

NVMe (Non-Volatile Memory Express)​

NVMe vs SATA​

NVMe Architecture​

NVMe Command Example​

NVMe Performance​

RAID (Redundant Array of Independent Disks)​

RAID Levels​

RAID Comparison​

Storage Interfaces​

Storage Performance Optimization​

1. Alignment​

2. File System Selection​

3. I/O Scheduler​

4. Read-Ahead​

5. Queue Depth​

Advanced Topics​

1. Over-Provisioning​

2. DRAM Cache​

3. SLC Cache​

4. Write Coalescing​

Best Practices​

Monitoring Tools​

Əlaqəli Mövzular​