Əsas məzmuna keçin

Storage Arxitekturası

Storage Hierarchy

graph TB
CPU[CPU Registers<br/>~1 ns<br/>Bytes] --> L1[L1 Cache<br/>~1 ns<br/>KB]
L1 --> L2[L2 Cache<br/>~3 ns<br/>MB]
L2 --> L3[L3 Cache<br/>~10 ns<br/>MB]
L3 --> RAM[RAM<br/>~100 ns<br/>GB]
RAM --> SSD[SSD<br/>~100 µs<br/>TB]
SSD --> HDD[HDD<br/>~10 ms<br/>TB]
HDD --> Tape[Tape/Archive<br/>seconds<br/>PB]

style CPU fill:#f96
style L1 fill:#fc9
style L2 fill:#ff9
style L3 fill:#9f9
style RAM fill:#9cf
style SSD fill:#c9f
style HDD fill:#f9c
style Tape fill:#ccc

Trade-off: Speed ↔ Capacity ↔ Cost

Hard Disk Drive (HDD)

HDD Strukturu

graph TB
HDD[HDD Components] --> Platter[Platters<br/>Magnetic disks]
HDD --> Spindle[Spindle<br/>Rotation motor]
HDD --> Head[Read/Write Heads]
HDD --> Arm[Actuator Arm]
HDD --> Controller[Controller]

Platter --> Tracks[Tracks<br/>Concentric circles]
Tracks --> Sectors[Sectors<br/>512B or 4KB]
Tracks --> Cylinders[Cylinders<br/>Same track across platters]

Əsas parametrlər:

  • RPM (Rotations Per Minute): 5400, 7200, 10000, 15000
  • Capacity: 1TB - 20TB
  • Interface: SATA, SAS
  • Cache: 64MB - 256MB

HDD Performance

Access time = Seek time + Rotational latency + Transfer time

sequenceDiagram
participant Controller
participant Arm as Actuator Arm
participant Head as Read/Write Head
participant Platter

Controller->>Arm: Move to track
Note over Arm: Seek time (~5-10 ms)

Arm->>Head: Position head
Note over Platter: Wait for rotation
Note over Head: Rotational latency (~4-8 ms)

Head->>Platter: Read/Write
Note over Head: Transfer time (~0.1 ms)

Nümunə hesablama:

Seek time:           8 ms (average)
Rotational latency: 4.17 ms (7200 RPM → 60000/7200/2)
Transfer time: 0.1 ms (512KB at 100 MB/s)
-----------------
Total: ~12.3 ms per random access

Random IOPS: 1000 / 12.3 ≈ 80 IOPS
Sequential: ~150-200 MB/s

Disk Scheduling Algorithms

1. FCFS (First-Come, First-Served)

Request queue: 98, 183, 37, 122, 14, 124, 65, 67
Head position: 53

Order: 53 → 98 → 183 → 37 → 122 → 14 → 124 → 65 → 67
Total movement: 45+85+146+85+108+110+59+2 = 640 cylinders

2. SSTF (Shortest Seek Time First)

Head: 53
Order: 53 → 65 → 67 → 37 → 14 → 98 → 122 → 124 → 183
Total movement: 12+2+30+23+84+24+2+59 = 236 cylinders

Problem: Starvation (far requests may never be served)

3. SCAN (Elevator Algorithm)

Head: 53, Direction: Right
Order: 53 → 65 → 67 → 98 → 122 → 124 → 183 → [end] → 37 → 14
Total movement: 183-53 + 183-14 = 130 + 169 = 299 cylinders

4. C-SCAN (Circular SCAN)

Head: 53, Direction: Right
Order: 53 → 65 → 67 → 98 → 122 → 124 → 183 → [jump to 0] → 14 → 37
Total movement: 130 + 183 + 37 = 350 cylinders

5. LOOK / C-LOOK

SCAN kimi, amma end-ə deyil, ən son request-ə qədər.

graph LR
FCFS[FCFS<br/>Fair<br/>Poor performance]
SSTF[SSTF<br/>Best performance<br/>Starvation]
SCAN[SCAN<br/>No starvation<br/>Good performance]
CLOOK[C-LOOK<br/>Better than SCAN<br/>Linux default]

Solid State Drive (SSD)

SSD Strukturu

graph TB
SSD[SSD] --> Controller[SSD Controller<br/>FTL firmware]
SSD --> DRAM[DRAM Cache<br/>Mapping table]
SSD --> Flash[NAND Flash Memory]

Flash --> Blocks[Blocks<br/>~256-512 KB]
Blocks --> Pages[Pages<br/>4-16 KB]

Controller --> FTL[Flash Translation Layer<br/>Logical → Physical]
Controller --> GC[Garbage Collection]
Controller --> WL[Wear Leveling]

NAND Flash növləri:

TipBits/CellSpeedEnduranceCostUse Case
SLC1Fastest~100k P/EHighestEnterprise
MLC2Fast~10k P/EHighConsumer high-end
TLC3Medium~3k P/EMediumConsumer
QLC4Slow~1k P/ELowArchive

P/E Cycles = Program/Erase cycles

Flash Translation Layer (FTL)

sequenceDiagram
participant OS
participant FTL
participant Flash

OS->>FTL: Write to LBA 100
FTL->>FTL: Check mapping table
FTL->>Flash: Write to Physical Block 532
FTL->>FTL: Update mapping: LBA 100 → PBA 532

Note over FTL: Old data (if exists) marked invalid

FTL vəzifələri:

  • Logical-to-Physical mapping
  • Wear leveling
  • Garbage collection
  • Bad block management

Wear Leveling

NAND flash-ın hər block-u məhdud sayda yazıla bilər.

graph TB
WL[Wear Leveling] --> Static[Static Wear Leveling]
WL --> Dynamic[Dynamic Wear Leveling]

Static --> S1[Rarely-written data<br/>moved to high-wear blocks]
Dynamic --> D1[Frequently-written data<br/>spread across blocks]

Nümunə:

Block 0: 5000 P/E cycles
Block 1: 2000 P/E cycles
Block 2: 100 P/E cycles

FTL action: Write new data to Block 2 (lowest wear)
If Block 2 has static data → move to Block 0, use Block 2 for new writes

Garbage Collection

sequenceDiagram
participant OS
participant SSD

Note over SSD: Block has valid & invalid pages

OS->>SSD: Write request
SSD->>SSD: No free pages!

Note over SSD: Trigger GC
SSD->>SSD: 1. Read valid pages
SSD->>SSD: 2. Write to new block
SSD->>SSD: 3. Erase old block
SSD->>SSD: 4. Now have free pages

SSD->>OS: Write complete

Write Amplification:

Write Amplification Factor (WAF) = Data written to flash / Data written by host

Example:
Host writes 1 GB
SSD writes 1 GB (new) + 500 MB (GC moves valid data)
WAF = 1.5 GB / 1 GB = 1.5

WAF minimize etmək:

  • Over-provisioning (extra capacity for GC)
  • TRIM command (OS tells SSD which blocks are free)
  • Write less frequently

TRIM Command

# Linux - manual TRIM
fstrim -v /

# Enable periodic TRIM
systemctl enable fstrim.timer

# Check if SSD supports TRIM
lsblk -D
sequenceDiagram
participant OS
participant FS as File System
participant SSD

OS->>FS: Delete file
FS->>FS: Mark blocks as free
FS->>SSD: TRIM command (LBAs)
SSD->>SSD: Mark physical blocks as invalid

Note over SSD: No need to preserve data<br/>Better GC performance

HDD vs SSD Comparison

XüsusiyyətHDDSSD
Speed
Sequential Read100-200 MB/s500-7000 MB/s
Sequential Write100-200 MB/s500-7000 MB/s
Random IOPS80-16010k-1M
Latency5-10 ms0.1 ms
Physical
Shock resistantNo (moving parts)Yes
NoiseYesNo
Power6-10W2-5W
HeatMoreLess
Cost & Capacity
$/GB~$0.02~$0.10
Max capacity20 TB8 TB (consumer)
Lifespan3-5 years5-10 years
Use Case
Large storage
Boot drive
Databases
Archive

NVMe (Non-Volatile Memory Express)

NVMe vs SATA

graph TB
subgraph SATA
SATA_OS[OS] --> SATA_AHCI[AHCI Driver]
SATA_AHCI --> SATA_Controller[SATA Controller]
SATA_Controller --> SATA_SSD[SSD]

Note1[Single queue<br/>32 commands]
end

subgraph NVMe
NVMe_OS[OS] --> NVMe_Driver[NVMe Driver]
NVMe_Driver --> PCIe[PCIe]
PCIe --> NVMe_SSD[NVMe SSD]

Note2[64k queues<br/>64k commands each]
end
XüsusiyyətSATANVMe
InterfaceSATA (6 Gbps)PCIe 3.0 x4 (32 Gbps)
Max speed~550 MB/s~3500 MB/s (PCIe 3.0)
~7000 MB/s (PCIe 4.0)
Queue depth3264k
Queues164k
LatencyHigherLower (less protocol overhead)
CPU overheadHigherLower (direct PCIe)

NVMe Architecture

graph TB
App[Application] --> Kernel[Kernel]
Kernel --> Driver[NVMe Driver]

Driver --> SQ[Submission Queue<br/>Commands to device]
Driver --> CQ[Completion Queue<br/>Results from device]

SQ --> Device[NVMe Device]
Device --> CQ

Device --> Flash[NAND Flash]
Device --> Controller[NVMe Controller]

Command flow:

  1. Driver writes command to Submission Queue
  2. Driver rings doorbell register (notify device)
  3. Device fetches command via DMA
  4. Device processes command
  5. Device writes completion entry to Completion Queue
  6. Device sends interrupt (or driver polls)
  7. Driver reads Completion Queue

NVMe Command Example

struct nvme_rw_command {
uint8_t opcode; // READ or WRITE
uint8_t flags;
uint16_t command_id;
uint32_t nsid; // Namespace ID
uint64_t slba; // Starting LBA
uint16_t length; // Number of blocks
// ... more fields
uint64_t prp1; // Physical Region Page 1
uint64_t prp2; // Physical Region Page 2
};

// Submit read command
void nvme_read(uint64_t lba, uint16_t blocks, void* buffer) {
struct nvme_rw_command cmd = {0};
cmd.opcode = NVME_CMD_READ;
cmd.nsid = 1;
cmd.slba = lba;
cmd.length = blocks - 1; // 0-based
cmd.prp1 = virt_to_phys(buffer);

// Write to submission queue
submit_queue[sq_tail] = cmd;
sq_tail = (sq_tail + 1) % queue_size;

// Ring doorbell
writel(sq_tail, doorbell_register);
}

NVMe Performance

PCIe 3.0 x4: ~4 GB/s theoretical, ~3.5 GB/s real
PCIe 4.0 x4: ~8 GB/s theoretical, ~7 GB/s real
PCIe 5.0 x4: ~16 GB/s theoretical, ~14 GB/s real

Random 4K IOPS: 500k - 1M
Latency: 10-20 µs

RAID (Redundant Array of Independent Disks)

RAID Levels

RAID 0 - Striping

graph LR
Data[Data: ABCDEFGH] --> Split{Split}
Split --> Disk1[Disk 1<br/>ACEG]
Split --> Disk2[Disk 2<br/>BDFH]
  • Capacity: N × disk_size
  • Performance: N × speed
  • Redundancy: None (any disk fails → data lost)
  • Use case: Performance, temporary data

RAID 1 - Mirroring

graph LR
Data[Data: ABCDEFGH] --> Mirror{Mirror}
Mirror --> Disk1[Disk 1<br/>ABCDEFGH]
Mirror --> Disk2[Disk 2<br/>ABCDEFGH]
  • Capacity: disk_size
  • Performance: Read: 2×, Write: 1×
  • Redundancy: 1 disk failure tolerated
  • Use case: Critical data

RAID 5 - Striping with Parity

graph LR
Data[Data: ABCDEF] --> Stripe{Stripe + Parity}
Stripe --> Disk1[Disk 1<br/>A, B, Parity_CD]
Stripe --> Disk2[Disk 2<br/>C, Parity_AB, E]
Stripe --> Disk3[Disk 3<br/>Parity_EF, D, F]
  • Capacity: (N-1) × disk_size
  • Performance: Read: fast, Write: slower (parity calc)
  • Redundancy: 1 disk failure
  • Use case: General purpose

Parity calculation:

A = 10110101
B = 11001010
Parity = A XOR B = 01111111

If A is lost:
A = B XOR Parity = 11001010 XOR 01111111 = 10110101

RAID 6 - Double Parity

graph LR
Data[Data: ABCD] --> DP{Double Parity}
DP --> Disk1[Disk 1<br/>A, P1, Q1]
DP --> Disk2[Disk 2<br/>B, P2, Q2]
DP --> Disk3[Disk 3<br/>C, P3, Q3]
DP --> Disk4[Disk 4<br/>D, P4, Q4]
  • Capacity: (N-2) × disk_size
  • Redundancy: 2 disk failures
  • Use case: High reliability

RAID 10 (1+0) - Mirrored Stripes

graph TB
Data[Data: ABCDEFGH] --> Stripe{Stripe}
Stripe --> S1[ACEG]
Stripe --> S2[BDFH]

S1 --> Mirror1{Mirror}
S2 --> Mirror2{Mirror}

Mirror1 --> D1[Disk 1<br/>ACEG]
Mirror1 --> D2[Disk 2<br/>ACEG]
Mirror2 --> D3[Disk 3<br/>BDFH]
Mirror2 --> D4[Disk 4<br/>BDFH]
  • Capacity: N/2 × disk_size
  • Performance: Excellent
  • Redundancy: 1 disk per mirror
  • Use case: High performance + reliability

RAID Comparison

RAIDCapacityPerformanceRedundancyMin Disks
0100%ExcellentNone2
150%Good read1 disk2
5(N-1)/NGood1 disk3
6(N-2)/NGood2 disks4
1050%Excellent1 per mirror4

Storage Interfaces

graph TB
Interfaces[Storage Interfaces] --> SATA[SATA<br/>6 Gbps<br/>~550 MB/s]
Interfaces --> SAS[SAS<br/>12 Gbps<br/>~1200 MB/s]
Interfaces --> NVMe[NVMe<br/>PCIe 4.0 x4<br/>~7000 MB/s]
Interfaces --> USB[USB<br/>USB 3.2: 20 Gbps]

SATA --> Consumer[Consumer HDDs/SSDs]
SAS --> Enterprise[Enterprise storage]
NVMe --> HighPerf[High-performance SSDs]
USB --> External[External drives]

Storage Performance Optimization

1. Alignment

# Check partition alignment
sudo parted /dev/sda align-check optimal 1

# Align to 1MB boundary (optimal for SSDs)
sudo parted /dev/sda mkpart primary 1MiB 100%

Misalignment nəticəsi:

Partition starts at 512B
SSD page size: 4KB

Write 4KB → spans 2 pages → 2 operations instead of 1

2. File System Selection

FSBest forFeatures
ext4General LinuxMature, journaling
XFSLarge filesGood performance
BtrfsSnapshotsCoW, compression
F2FSSSDFlash-friendly
NTFSWindowsJournaling
APFSmacOSSSD-optimized

3. I/O Scheduler

# Check current scheduler
cat /sys/block/sda/queue/scheduler

# Set scheduler
echo mq-deadline > /sys/block/sda/queue/scheduler

# Schedulers:
# - none: No scheduling (for NVMe)
# - mq-deadline: Good for SSD
# - bfq: Fair queuing (desktop)
# - kyber: Low latency

4. Read-Ahead

# Check read-ahead
sudo blockdev --getra /dev/sda

# Set read-ahead (in 512-byte sectors)
sudo blockdev --setra 256 /dev/sda # 128 KB

5. Queue Depth

# Check queue depth
cat /sys/block/nvme0n1/queue/nr_requests

# Increase for high-IOPS workloads
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

Advanced Topics

1. Over-Provisioning

User-visible capacity: 240 GB
Physical capacity: 256 GB
Over-provisioning: 16 GB (6.25%)

Purpose:
- Reserve space for wear leveling
- Better GC performance
- Maintain performance over time

2. DRAM Cache

graph LR
Host[Host] --> DRAM[DRAM Cache<br/>Mapping table<br/>Write buffer]
DRAM --> Flash[NAND Flash]

DRAM -.Power loss.-> Lost[Data lost<br/>without capacitors]

DRAM-less SSDs:

  • Cheaper
  • Slower (mapping in HMB - Host Memory Buffer)
  • Less reliable

3. SLC Cache

graph TB
Write[Write Request] --> SLC[SLC Cache<br/>Fast TLC treated as SLC]

SLC -->|Cache full| Fold[Folding to TLC]
Fold --> TLC[TLC Storage<br/>Slower but more capacity]

Performance pattern:

Fast writes (SLC cache): 500 MB/s
Cache full, folding: 100 MB/s (slow!)

4. Write Coalescing

// Instead of:
write(fd, buffer, 4096); // 4K write
write(fd, buffer, 4096); // 4K write
write(fd, buffer, 4096); // 4K write
write(fd, buffer, 4096); // 4K write

// Coalesce:
write(fd, large_buffer, 16384); // 16K write (1 flash page)

Best Practices

  1. SSD:

    • Enable TRIM
    • Don't defragment
    • Disable hibernation/swap on consumer SSDs
    • Keep 10-20% free space
  2. HDD:

    • Regular defragmentation (Windows)
    • Avoid excessive head movement
    • Use for cold storage
  3. RAID:

    • Monitor disk health (SMART)
    • Replace failed disks immediately
    • Use hot spares
  4. Performance:

    • Align partitions
    • Choose right file system
    • Use appropriate I/O scheduler
    • Monitor with iostat, iotop
  5. Reliability:

    • Regular backups (3-2-1 rule)
    • Monitor disk temperature
    • Check SMART attributes

Monitoring Tools

# Disk usage
df -h
lsblk

# I/O statistics
iostat -x 1

# Disk activity
iotop

# SMART status
smartctl -a /dev/sda

# NVMe info
nvme list
nvme smart-log /dev/nvme0n1

# Benchmark
fio --name=test --rw=randread --bs=4k --size=1G

Əlaqəli Mövzular

  • I/O Systems: DMA, interrupts
  • Memory Hierarchy: Caching strategies
  • File Systems: Storage management
  • Performance: I/O optimization
  • Reliability: RAID, backups