Linux Filesystem Guide (2025)

Best filesystems for various parameters

Best for flashdrives with throwaway data

  • For UEFI, DOS, Windows Vista or earlier, old MacOS: FAT32

  • For Windows 7 or later, MacOS BigSur or later: exFAT

  • For purely Linux: F2FS (compression=lz4,noatime,lazytime,discard)

Why only throwaway data?

These filesystems do not checksum data, and they are not Copy-on-Write (meaning they overwrite files in-place).

What if the data is sensitive?

Use LUKS with AES-128-GCM under F2FS.

Best for PCs with only HDDs, OR flashdrives with important data

btrfs (compression=zstd-1,noatime,lazytime) on LUKS (aes-128-gcm); queue depth at 16 for HDDs, higher for flash

  • noatime is mandatory to avoid snapshot data duplication.

  • btrfs supports defragmentation, which is important on HDDs.

  • Never do RAID 5/6 on btrfs. RAID0 is fine. RAID1 is counterintuitive when degraded. All RAID is annoying with LUKS. If you need RAID1 and aren't pre-prepared for btrfs's specific brand of weirdness, consider using mdadm under LUKS, though note that this loses checksummed self-healing from bitrot.

  • Why not ZFS?: ZFS should not be used on an HDD without a PLP-protected/DRAM-less/write-cache-disabled SLOG. Inline ZIL will gradually cause massive, irreparable fragmentation which will make performance abysmal; and a lack of a separate sync domain will cause immediate flushing of transaction groups, making this problem even worse. ZFS has no way to defragment. Engineering didn't even want running without a SLOG to be possible (It was a business ask.), and that was before SSDs even existed. Don't use ZFS on an HDD unless you have a spare device (even if it's an HDD) that you can dedicate to SLOGging.

Best filesystem for PCs with only SSDs

No namespace support

Ensure that you use SSDs that either have no DRAM or that have PLP.

ZFS (compression=zstd-fast-1,encryption=aes-128-gcm,recordsize=64K,logbias=throughput,zfs_immediate_write_sz=64K,checksum=fletcher4,noatime) + ZFSBootMenu; queue depth at 32 for SATA or higher if SAS/NVMe

  • This skips the ZIL in a way that does not risk the durability of sync writes.

    • The ZIL exists primarily to avoid fragmentation on HDDs and to provide a separate domain for sync writes; these benefits are less-important on SSDs (especially NVMes), and has a cost of doubling your writes.
Namespace support

Only available in some enterprise-grade NVMe SSDs.

  • Namespace 1: 12GiB SLOG

  • Namespace 2: ZFS (compression=zstd-fast-1,encryption=aes-128-gcm,recordsize=64K,logbias=latency,zfs_immediate_write_sz=16K,checksum=fletcher4,noatime) + ZFSBootMenu; high queue depth

Minimal performant ZFS setup

  • 1 SATA HDD (CMR; ZFS overwhelms SMR) for VDEV (compression=zstd-fast-1,recordsize=256K,logbias=latency,zfsimmediatewrite_sz=64K,checksum=fletcher4,noatime); configure Linux's queue depth to 16 and enable Linux's I/O scheduler.

  • 1 SATA SSD (no DRAM) with GPT partitions for SLOG and SVDEV (64K and smaller) (don't need PLP if no DRAM, but having DRAM improves perf) (NVMe with namespaces instead of partitions is way better perf; namespaces have their own sync domains); configure Linux's queue depth high (32 for SATA SSDs, 128 for SAS SSDs, whatever you want for NVMe SSDs) and disable Linux's I/O scheduler.

  • 1 SATA HDD (CMR) for backups (zfs send -w | zfs recv -F); configure Linux's queue depth to 16 and enable Linux's I/O scheduler.

Double the drives for self-healing from corruption and uninterrupted operation after drive failure.

Regarding various options

Trim/defrag

  • Trim is for SSDs; defragmentation is for HDDs.

    • Technically SSDs can also suffer degraded performance from fragmentation, but it's minor and it's not considered worthwhile to defragment them, particularly given that controllers apparently often misrepresent what is actually fragmented. If you insist on defragmenting your SSDs, once per year is probably the smallest interval you should consider.
  • For removeable media, do it continuously.

  • For long-term media, do it periodically.

  • Why: Continuous trim/defrag harms perf and longevity, but removable media aren't guaranteed to be present during periodic trims/defrags, so they should be trimmed/defragged continuously as a compromise.

Queue depth

  • HDDs are generally best at 16.

  • SSDs are generally best with high values.

  • ZFS's internal scheduler can only handle one max queue depth per drive across the entire OS. So on mixed-media pools, you either have to gimp your high-queue media by capping ZFS's queue low OR you have to let Linux's I/O scheduling handle the overqueuing. The latter is generally the better choice. In any case: disable Linux's I/O scheduling when your device's queue depth equals ZFS's.

Compression

In my own benchmarks, zstd-fast-1 is consistently faster and more-compressive than lz4 both in and outside of ZFS, for a comparable CPU hit. zstd-1 also beats lz4 in performance, but for slightly higher CPU; and it loses to zstd-fast-1. Your machine may benchmark differently.

Encryption

  • Native encryption is generally best, when it's available; it's seamless and avoids an extra tool/complexity, and often comes with integrations.

  • AES-128 is the perfect choice. There's afaict no good reason to believe quantum computing is a viable threat to it in your lifetime. Even assuming Moore's Law for quantum does not make it economical to crack AES-128 except for matters of extreme national concern; and quantum's actual rate of advancement is nowhere near Moore's law. And classical computing is no threat to it, either. AES-256 is for the absolutely paranoid. It's appropriate for state secrets that need protection against decryption in the 22nd or 23rd century; not your personal files. Save your CPU cycles for useful work.

Deduplication

  • ZFS's online dedupe is crazy-intensive and only worth it in very specific contexts.

  • You can get 90% of the benefit for minimal system resources by periodically running rdfind; just make sure you tell it to use a 256-bit checksum. This performs file-based hardlink deduplication, and works on any filesystem that supports hardlinks (all Linux FSes).

Duplication

  • btrfs (with data=dup) and ZFS (with copies=2) allow for healing from bitrot on a single drive, but this halves the usable space. It's a great option for backup drives: a single drive at twice your capacity is cheaper and simpler than two drives and a DAS enclosure.

  • ZFS defaults to duplicating all metadata, as there are severe data-loss risks if metadata is ever lost. If you are using a fault-tolerant RAID level, you may change this to most.

Checksums

  • Fletcher4 can be done 41 billion times per second vs BLAKE3's 7000 times per second on an AMD EPYC 4464P.

  • Fletcher4 will catch bitrot; it will not catch deliberate hash collisions, and it is unsuitable for deduplication.

    • You do not need to worry about either: you're not using ZFS deduplication, and deliberate hash collisions are protected against by native encryption, by not zfs send | zfs recving from untrusted datasets, and by not giving people root-level access to datasets.

atime

  • Set atime to noatime unless you have a very good reason to do otherwise.

    • The default relatime is still quite bad for performance and longevity, and benefits precious few applications, none of which you are ever likely to use. If you ever do use one of those apps, give it its own dataset (ZFS) or subvolume (btrfs) and have atime set only there.
    • Snapshotting filesystems (like btrfs) may, after a snapshot, duplicate all metadata when the atime is updated. This means that running find just once can dramatically increase your total used space. On btrfs, this is a major problem, and it is also a problem on LVM; on ZFS, metadata is only partly duplicated, so it's about two orders of magnitude less of a problem than with btrfs, but it is still problematic. Combining relatime or especially atime with frequent automatic snapshotting can result in storage usage positively ballooning.
  • lazytime defers updates to atime, ctime, and mtime by up to 12 hours. It makes relatime less of a performance hit; but even with noatime, it still helps with ctime and mtime changes. There is generally no reason to not always use it where it is available, outside of RAMdisks (where all the metadata is already in RAM anyway) and snapshotting (since unwritten times won't be reflected in the snapshots). Applications that really depend on time metadata should typically call fsync() or sync() all the time, which will make their time metadata immediately durable in spite of lazytime. And when using hourly snapshots, it twelfths the amount of metadata that is duplicated (though can result in untrue timestamps in such snapshots). This shouldn't be a problem for mtime/ctime changes that accompany actual file changes, since those shouldn't be affected by lazytime; and explicitly calling sync before zfs snapshot should flush the lazytime changes to disk before the snapshot, ensuring consistency but no longer twelfthing the metadata duplications.

recordsize

  • There is simply a fixed cost to all I/O from the physical mechanics of the drives.

    • HDDs take the same amount of time to read all recordsizes below 128K; and even 128K is close to this threshold, though the exact figure varies by drive, and some are over 128K.

      • Consequently, HDDs should never be set to a recordsize below 128K; it's literally pointless.
    • SSDs have a kink too, but it's lower — below 32K for SATA ones. It varies by drive.

  • In mixed media (HDD vdev + SSD svdev), I like to tune the HDD to 256K and the SSD to 64K; this allows each to focus on what they're good at. 128K is an unhappy compromise, just like inline ZIL.

Additional ZFS notes

Additional recommendations

  • On Linux, set xattr=sa for improved xattr/ACL performance.

  • Set normalization=FormD if you are okay with the drawbacks of forcing everything to be UTF-8-only. It makes ‹ó› and ‹ó› act like the same character, and is more-performant than FormC. The other forms normalize too many characters, which harms performance and risks accidents.

Caveats

  • Unless you are running ZFS <2.0.0 or >=2.3.3, only do raw (-w) sends from encrypted datasets or you risk a data loss bug. For backups, raw sends are optimal anyway. At time of writing, no-one ships ZFS 2.3.3 yet because it has not been released!

  • Set init_on_alloc=0 on your kernel commandline; the default works fine in normal Linux but kills ARC performance with ZFS.

  • ZFS supports lazytime, but it lacks a way to set it automatically. If you want to use it, you need to manually mount -o remount,lazytime "$DATASET". You can automate this by wrapping zfs and mount with custom bash scripts in /usr/local/bin which secretly add lazytime to the mount options of everything those commands mount; you can then override this by passing nolazytime. You also need to write a script that remount all mounted filesystems with lazytime, and schedule that to run after your filesystems are mounted during startup.

  • ZFS does not have a built-in way to schedule periodic trims; you have to do it manually; and, ideally, trim each SSD in a pool at different times.

  • Do not put swap on ZFS; it does not work well. Likewise, do not put swap outside of encrypted ZFS unless you encrypt that swapfile, else contents of memory can be read by an attacker. Using zram swap avoids these complications. Do not mix zram swap with disk swap or you will face priority inversion. If you use zswap, you can set it to a weak compression algorithm and your zram swap to a strong compression algorithm to get a kind of tiered swap, where the least-needed items are the most-heavily compressed.


You'll only receive email when they publish something new.

More from Miles Huff's Blog
All posts