Skip to content

Instantly share code, notes, and snippets.

@R4wm
Created December 30, 2025 16:06
Show Gist options
  • Select an option

  • Save R4wm/9d2d40e4a71bbc7fba8b118f55af5892 to your computer and use it in GitHub Desktop.

Select an option

Save R4wm/9d2d40e4a71bbc7fba8b118f55af5892 to your computer and use it in GitHub Desktop.

RAID1 Hot-Swap Procedure & Maintenance

Array Info

  • Device: /dev/md0
  • Type: RAID1 (mirror)
  • Size: 8TB (2x 8TB drives)
  • Mount: /mnt/storage

Daily/Weekly Health Checks

Quick Status Check

cat /proc/mdstat
  • [UU] = Both drives healthy
  • [U_] or [_U] = One drive failed/missing

Detailed Array Info

sudo mdadm --detail /dev/md0

Check Drive SMART Health

# Install smartmontools if needed
sudo apt install smartmontools

# Check each drive
sudo smartctl -H /dev/sda
sudo smartctl -H /dev/sdg

# Full SMART report
sudo smartctl -a /dev/sda
sudo smartctl -a /dev/sdg

Check for Drive Errors

# Kernel messages for disk errors
sudo dmesg | grep -iE 'error|fail|sd[a-z]'

# Check system logs
sudo journalctl -u mdmonitor --since "24 hours ago"

Monthly Maintenance Plan

1. SMART Self-Test (run monthly)

# Start short self-test (~2 min)
sudo smartctl -t short /dev/sda
sudo smartctl -t short /dev/sdg

# Check results after test completes
sudo smartctl -l selftest /dev/sda
sudo smartctl -l selftest /dev/sdg

2. Array Consistency Check (run monthly)

# Trigger a check (non-destructive, runs in background)
sudo echo check > /sys/block/md0/md/sync_action

# Monitor progress
cat /proc/mdstat

# Check result when done
sudo cat /sys/block/md0/md/mismatch_cnt
# Should be 0 (or very low)

3. Review SMART Attributes to Watch

sudo smartctl -A /dev/sda | grep -E 'Reallocated|Pending|Uncorrectable|Power_On|Temperature'
  • Reallocated_Sector_Ct - bad sectors remapped (watch for increases)
  • Current_Pending_Sector - sectors waiting to be remapped
  • Offline_Uncorrectable - sectors that couldn't be read
  • Temperature_Celsius - keep below 50°C

Hot-Swap Procedure (when a drive fails)

1. Identify the failed drive

cat /proc/mdstat
# Look for [U_] or [_U] - underscore shows failed position

sudo mdadm --detail /dev/md0
# Shows "faulty" or "removed" next to bad drive

2. Remove the failed drive from array

sudo mdadm --remove /dev/md0 /dev/sdX1
# Replace sdX1 with actual failed device

3. Physically swap the drive

  • Power down if needed (or hot-swap if supported)
  • Remove failed drive
  • Insert new drive (must be same size or larger: 8TB)

4. Partition the new drive

# Option A: Copy partition table from good drive
sudo sgdisk -R /dev/sdNEW /dev/sdGOOD
sudo sgdisk -G /dev/sdNEW

# Option B: Manual partition
sudo sgdisk -n 1:2048:15628052479 -t 1:fd00 /dev/sdNEW

5. Add new drive to array

sudo mdadm --add /dev/md0 /dev/sdNEW1

6. Monitor rebuild

watch cat /proc/mdstat

Automated Monitoring Setup

Enable Email Alerts

# Edit mdadm config
sudo nano /etc/mdadm/mdadm.conf

# Set MAILADDR to your email
MAILADDR your@email.com

# Restart monitor
sudo systemctl restart mdmonitor

Setup SMART Monitoring Daemon

# Enable smartd
sudo systemctl enable smartd
sudo systemctl start smartd

# Edit config for email alerts
sudo nano /etc/smartd.conf
# Add line for each drive:
/dev/sda -a -m your@email.com
/dev/sdg -a -m your@email.com

Important Notes

  • Rebuild time: ~11 hours for 8TB at ~200MB/s
  • During rebuild: Array accessible but slower
  • Backup reminder: RAID is NOT a backup - keep offsite copies of critical data
  • Drive lifespan: Plan to replace drives every 3-5 years proactively
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment