Note: this is mainly meant for Root on ZFS on desktop or server systems, where latency is important.
Goal: keep the machine responsive under heavy writes (especially with compression enabled) by letting ZFS buffer more in RAM, limiting CPU spent in the write pipeline, and sending large I/Os to the SSD.
Note: Zstd compression / blake3 hashing / dnodesize auto require (if you use e.g. grub) to have a dedicated /boot partition.
- Set "dirty" buffering to 1-2 GiB, so short bursts of writes don't immediately throttle.
zfs_dirty_data_max=1073741824zfs_dirty_data_max=2147483648
- Delay throttling until 90% of the dirty buffer limit is reached.
zfs_delay_min_dirty_percent=90
- Limit ZIO write taskq parallelism to keep the CPU responsive under compression.
- Rule of thumb:
zio_taskq_write_tpq = floor(physical_cores * 0.5) - If the machine is basically "just storage", you can push this higher (roughly
~75%of CPU threads), but only if there's no other heavy interactive load to process aka for example a web UI. - Note: long writes (above 0.9 x zfs_dirty_data_max) will slow down, as there's less CPU power available.
- Rule of thumb:
- Increase ZFS "block size" to 512 KiB on all datasets (large files / sequential I/O) and zstd on a high level, as the CPU can chrunch in the background through it, and decompression is still blazing fast on modern CPUs. Higher compression ratio reduces bytes written/read from disk, often increasing overall throughput.
recordsize=512Kcompression=zstd-16
- Make sure the 512 KiB blocks can be issued to the SSD "in one go" (avoid premature splitting/aggregation limits).
- Check the block layer's advertised/effective request limits:
/sys/block/<dev>/queue/max_hw_sectors_kb(hardware limit)
- Set ZFS aggregation for SSDs to the hardware limit (on NVMe) and to 512 KiB (on SATA).
- Example for SATA:
zfs_vdev_aggregation_limit_non_rotating=524288
- Example for SATA:
- Check that Linux kernel is actually detecting your SSD as "non rotating" properly (otherwise the setting above will have no effect):
cat /sys/block/sda/queue/rotationalshould read 0 for SSDs - otherwise override this.
- Check the block layer's advertised/effective request limits:
- Set a minimum ARC size so ZFS doesn't shed cache too aggressively and instead starts swapping.
- Rule of thumb:
zfs_arc_min ~= 20% of MemTotal - Example for a "16 GiB" machine: 20% = 3.2 GiB
zfs_arc_min=3435973836
- Rule of thumb:
- Additional recommendations:
- Lower
redundant_metadatato at leastsometo reduce write overhead for metadata (use backups instead ;) ) - Make sure
atimeisoff - Set
checksumtoblake3(slower but increases safety margin to detect broken blocks via checksum) - Set
dnodesizetoauto
- Lower
/var/log: journal data is already compressed bysystemd-journald, and the rest is typically low-value to recompress.- On the default CachyOS ZFS layout,
/var/logis already its own dataset set therecompression=off.
- On the default CachyOS ZFS layout,
/var/cache: package caches (e.g. pacman packages) are already compressed.- On the default CachyOS ZFS layout,
/var/cacheis already its own dataset set therecompression=off.
- On the default CachyOS ZFS layout,
/tmp: consider creating a dedicated dataset mounted at/tmpwith compression disabled (not worth the CPU cycles usually)
Deduplication improved massively with fast_dedup (see https://klarasystems.com/articles/introducing-openzfs-fast-dedup/). If you got data sets where you might store duplicate blocks, consider activating it for these datasets. But only if you're tight on storage space, as it will reduce the performance and use more memory to hold the deduplication tables.
- Set
deduptoblake3
Changing dataset properties like compression, dedup, checksum, copies, etc. only affects new writes. To rewrite existing file data with the new settings, use zfs rewrite on the dataset's mountpoint (or a specific directory):
# rewrite everything below this directory (recommended: use -x to not cross into other datasets)
zfs rewrite -r -x /path/to/dataset/mountpoint
# optional: physical rewrite preserves logical birth times (requires feature@physical_rewrite)
zfs rewrite -P -r -x /path/to/dataset/mountpointNote: zfs rewrite does not apply changes which require a different logical block size (e.g. recordsize).
- Disabling cache flushes can improve performance, but is ONLY recommended if:
- Your storage has real power-loss protection (PLP) / battery-backed write cache, OR
- You can accept exessive rollback of writes on a crash (this is not meant for mission critical / database-server).
zfs_nocacheflush=1zil_nocacheflush=1
sync=disabledwill do a similar thing, as it will allow buffering writes the application tells us "to write to storage" just to memory. So if the system is battery buffered and not a database server, this is probably acceptable (use your own judgement!)
echo 1073741824 > /sys/module/zfs/parameters/zfs_dirty_data_max
echo 90 > /sys/module/zfs/parameters/zfs_delay_min_dirty_percent
echo 3 > /sys/module/zfs/parameters/zio_taskq_write_tpq # 6 real core system
echo 524288 > /sys/module/zfs/parameters/zfs_vdev_aggregation_limit_non_rotating # on SATA (otherwise use your NVMe hardware limit)
echo 3435973836 > /sys/module/zfs/parameters/zfs_arc_min # 16 GB system
echo 1 > /sys/module/zfs/parameters/zfs_nocacheflush # use your own judgement here!
echo 1 > /sys/module/zfs/parameters/zil_nocacheflush # use your own judgement here!# create /etc/modprobe.d/zfs-tuning.conf and add for example
options zfs zfs_dirty_data_max=1073741824
options zfs zfs_delay_min_dirty_percent=90
options zfs zio_taskq_write_tpq=3 # 6 real core system
options zfs zfs_vdev_aggregation_limit_non_rotating=524288 # on SATA (otherwise use your NVMe hardware limit)
options zfs zfs_arc_min=3435973836 # 16 GB system
options zfs zfs_nocacheflush=1 # use your own judgement here!
options zfs zil_nocacheflush=1 # use your own judgement here!
# reload into initramfs if you use that:
mkinitcpio -P