ZFS tuning notes

Note: this is mainly meant for Root on ZFS on desktop or server systems, where latency is important.

Goal: keep the machine responsive under heavy writes (especially with compression enabled) by letting ZFS buffer more in RAM, limiting CPU spent in the write pipeline, and sending large I/Os to the SSD.

Note: Zstd compression / blake3 hashing / dnodesize auto require (if you use e.g. grub) to have a dedicated /boot partition.

Set "dirty" buffering to 1-2 GiB, so short bursts of writes don't immediately throttle.
- zfs_dirty_data_max=1073741824
- zfs_dirty_data_max=2147483648
Delay throttling until 90% of the dirty buffer limit is reached.
- zfs_delay_min_dirty_percent=90
Limit ZIO write taskq parallelism to keep the CPU responsive under compression.
- Rule of thumb: zio_taskq_write_tpq = floor(physical_cores * 0.5)
- If the machine is basically "just storage", you can push this higher (roughly ~75% of CPU threads), but only if there's no other heavy interactive load to process aka for example a web UI.
- Note: long writes (above 0.9 x zfs_dirty_data_max) will slow down, as there's less CPU power available.
Increase ZFS "block size" to 512 KiB on all datasets (large files / sequential I/O) and zstd on a high level, as the CPU can chrunch in the background through it, and decompression is still blazing fast on modern CPUs. Higher compression ratio reduces bytes written/read from disk, often increasing overall throughput.
- recordsize=512K
- compression=zstd-16
Make sure the 512 KiB blocks can be issued to the SSD "in one go" (avoid premature splitting/aggregation limits).
- Check the block layer's advertised/effective request limits:
  - /sys/block/<dev>/queue/max_hw_sectors_kb (hardware limit)
- Set ZFS aggregation for SSDs to the hardware limit (on NVMe) and to 512 KiB (on SATA).
  - Example for SATA: zfs_vdev_aggregation_limit_non_rotating=524288
- Check that Linux kernel is actually detecting your SSD as "non rotating" properly (otherwise the setting above will have no effect):
  - cat /sys/block/sda/queue/rotational should read 0 for SSDs - otherwise override this.
Set a minimum ARC size so ZFS doesn't shed cache too aggressively and instead starts swapping.
- Rule of thumb: zfs_arc_min ~= 20% of MemTotal
- Example for a "16 GiB" machine: 20% = 3.2 GiB
  - zfs_arc_min=3435973836
Additional recommendations:
- Lower redundant_metadata to at least some to reduce write overhead for metadata (use backups instead ;) )
- Make sure atime is off
- Set checksum to blake3 (slower but increases safety margin to detect broken blocks via checksum)
- Set dnodesize to auto

Where to avoid compression

/var/log: journal data is already compressed by systemd-journald, and the rest is typically low-value to recompress.
- On the default CachyOS ZFS layout, /var/log is already its own dataset set there compression=off.
/var/cache: package caches (e.g. pacman packages) are already compressed.
- On the default CachyOS ZFS layout, /var/cache is already its own dataset set there compression=off.
/tmp: consider creating a dedicated dataset mounted at /tmp with compression disabled (not worth the CPU cycles usually)

Deduplication

Deduplication improved massively with fast_dedup (see https://klarasystems.com/articles/introducing-openzfs-fast-dedup/). If you got data sets where you might store duplicate blocks, consider activating it for these datasets. But only if you're tight on storage space, as it will reduce the performance and use more memory to hold the deduplication tables.

Set dedup to blake3

Rewrite existing data on disk (apply new compression / dedup / checksum)

Changing dataset properties like compression, dedup, checksum, copies, etc. only affects new writes. To rewrite existing file data with the new settings, use zfs rewrite on the dataset's mountpoint (or a specific directory):

# rewrite everything below this directory (recommended: use -x to not cross into other datasets)
zfs rewrite -r -x /path/to/dataset/mountpoint

# optional: physical rewrite preserves logical birth times (requires feature@physical_rewrite)
zfs rewrite -P -r -x /path/to/dataset/mountpoint

Note: zfs rewrite does not apply changes which require a different logical block size (e.g. recordsize).

Increase write performance / write lag for applications

Disabling cache flushes can improve performance, but is ONLY recommended if:
- Your storage has real power-loss protection (PLP) / battery-backed write cache, OR
- You can accept exessive rollback of writes on a crash (this is not meant for mission critical / database-server).
  - zfs_nocacheflush=1
  - zil_nocacheflush=1
sync=disabled will do a similar thing, as it will allow buffering writes the application tells us "to write to storage" just to memory. So if the system is battery buffered and not a database server, this is probably acceptable (use your own judgement!)

Runtime (until reboot)

echo 1073741824 > /sys/module/zfs/parameters/zfs_dirty_data_max
echo 90 > /sys/module/zfs/parameters/zfs_delay_min_dirty_percent
echo 3 > /sys/module/zfs/parameters/zio_taskq_write_tpq # 6 real core system
echo 524288 > /sys/module/zfs/parameters/zfs_vdev_aggregation_limit_non_rotating  # on SATA (otherwise use your NVMe hardware limit)
echo 3435973836 > /sys/module/zfs/parameters/zfs_arc_min # 16 GB system
echo 1 > /sys/module/zfs/parameters/zfs_nocacheflush # use your own judgement here!
echo 1 > /sys/module/zfs/parameters/zil_nocacheflush # use your own judgement here!

Persistent (early boot too)

# create /etc/modprobe.d/zfs-tuning.conf and add for example

options zfs zfs_dirty_data_max=1073741824
options zfs zfs_delay_min_dirty_percent=90
options zfs zio_taskq_write_tpq=3 # 6 real core system
options zfs zfs_vdev_aggregation_limit_non_rotating=524288 # on SATA (otherwise use your NVMe hardware limit)
options zfs zfs_arc_min=3435973836 # 16 GB system
options zfs zfs_nocacheflush=1 # use your own judgement here!
options zfs zil_nocacheflush=1 # use your own judgement here!


# reload into initramfs if you use that:
mkinitcpio -P

RubenKelevra/fast_zfs.md

Select an option

No results found