RAID1 Hot-Swap Procedure & Maintenance

Array Info

Device: /dev/md0
Type: RAID1 (mirror)
Size: 8TB (2x 8TB drives)
Mount: /mnt/storage

Daily/Weekly Health Checks

Quick Status Check

cat /proc/mdstat

[UU] = Both drives healthy
[U_] or [_U] = One drive failed/missing

Detailed Array Info

sudo mdadm --detail /dev/md0

Check Drive SMART Health

# Install smartmontools if needed
sudo apt install smartmontools

# Check each drive
sudo smartctl -H /dev/sda
sudo smartctl -H /dev/sdg

# Full SMART report
sudo smartctl -a /dev/sda
sudo smartctl -a /dev/sdg

Check for Drive Errors

# Kernel messages for disk errors
sudo dmesg | grep -iE 'error|fail|sd[a-z]'

# Check system logs
sudo journalctl -u mdmonitor --since "24 hours ago"

Monthly Maintenance Plan

1. SMART Self-Test (run monthly)

# Start short self-test (~2 min)
sudo smartctl -t short /dev/sda
sudo smartctl -t short /dev/sdg

# Check results after test completes
sudo smartctl -l selftest /dev/sda
sudo smartctl -l selftest /dev/sdg

2. Array Consistency Check (run monthly)

# Trigger a check (non-destructive, runs in background)
sudo echo check > /sys/block/md0/md/sync_action

# Monitor progress
cat /proc/mdstat

# Check result when done
sudo cat /sys/block/md0/md/mismatch_cnt
# Should be 0 (or very low)

3. Review SMART Attributes to Watch

sudo smartctl -A /dev/sda | grep -E 'Reallocated|Pending|Uncorrectable|Power_On|Temperature'

Reallocated_Sector_Ct - bad sectors remapped (watch for increases)
Current_Pending_Sector - sectors waiting to be remapped
Offline_Uncorrectable - sectors that couldn't be read
Temperature_Celsius - keep below 50°C

Hot-Swap Procedure (when a drive fails)

1. Identify the failed drive

cat /proc/mdstat
# Look for [U_] or [_U] - underscore shows failed position

sudo mdadm --detail /dev/md0
# Shows "faulty" or "removed" next to bad drive

2. Remove the failed drive from array

sudo mdadm --remove /dev/md0 /dev/sdX1
# Replace sdX1 with actual failed device

3. Physically swap the drive

Power down if needed (or hot-swap if supported)
Remove failed drive
Insert new drive (must be same size or larger: 8TB)

4. Partition the new drive

# Option A: Copy partition table from good drive
sudo sgdisk -R /dev/sdNEW /dev/sdGOOD
sudo sgdisk -G /dev/sdNEW

# Option B: Manual partition
sudo sgdisk -n 1:2048:15628052479 -t 1:fd00 /dev/sdNEW

5. Add new drive to array

sudo mdadm --add /dev/md0 /dev/sdNEW1

6. Monitor rebuild

watch cat /proc/mdstat

Automated Monitoring Setup

Enable Email Alerts

# Edit mdadm config
sudo nano /etc/mdadm/mdadm.conf

# Set MAILADDR to your email
MAILADDR your@email.com

# Restart monitor
sudo systemctl restart mdmonitor

Setup SMART Monitoring Daemon

# Enable smartd
sudo systemctl enable smartd
sudo systemctl start smartd

# Edit config for email alerts
sudo nano /etc/smartd.conf
# Add line for each drive:
/dev/sda -a -m your@email.com
/dev/sdg -a -m your@email.com

Important Notes

Rebuild time: ~11 hours for 8TB at ~200MB/s
During rebuild: Array accessible but slower
Backup reminder: RAID is NOT a backup - keep offsite copies of critical data
Drive lifespan: Plan to replace drives every 3-5 years proactively

R4wm/hot-swap-plan.md

Select an option

No results found

Select an option

No results found

RAID1 Hot-Swap Procedure & Maintenance

Array Info

Daily/Weekly Health Checks

Quick Status Check

Detailed Array Info

Check Drive SMART Health

Check for Drive Errors

Monthly Maintenance Plan

1. SMART Self-Test (run monthly)

2. Array Consistency Check (run monthly)

3. Review SMART Attributes to Watch

Hot-Swap Procedure (when a drive fails)

1. Identify the failed drive

2. Remove the failed drive from array

3. Physically swap the drive

4. Partition the new drive

5. Add new drive to array

6. Monitor rebuild

Automated Monitoring Setup

Enable Email Alerts

Setup SMART Monitoring Daemon

Important Notes