Linux Filesystem Guide (2025)
June 11, 2025•1,888 words
Best filesystems for various parameters
Best for flashdrives with throwaway data
For UEFI, DOS, Windows Vista or earlier, old MacOS: FAT32
For Windows 7 or later, MacOS BigSur or later: exFAT
For purely Linux: F2FS (
compression=lz4,noatime,lazytime,discard
)
Why only throwaway data?
These filesystems do not checksum data, and they are not Copy-on-Write (meaning they overwrite files in-place).
What if the data is sensitive?
Use LUKS with AES-128-GCM under F2FS.
Best for PCs with only HDDs, OR flashdrives with important data
btrfs (compression=zstd-1,noatime,lazytime
) on LUKS (aes-128-gcm
); queue depth at 16 for HDDs, higher for flash
noatime
is mandatory to avoid snapshot data duplication.btrfs supports defragmentation, which is important on HDDs.
Never do RAID 5/6 on btrfs. RAID0 is fine. RAID1 is counterintuitive when degraded. All RAID is annoying with LUKS. If you need RAID1 and aren't pre-prepared for btrfs's specific brand of weirdness, consider using mdadm under LUKS, though note that this loses checksummed self-healing from bitrot.
Why not ZFS?: ZFS should not be used on an HDD without a PLP-protected/DRAM-less/write-cache-disabled SLOG. Inline ZIL will gradually cause massive, irreparable fragmentation which will make performance abysmal; and a lack of a separate sync domain will cause immediate flushing of transaction groups, making this problem even worse. ZFS has no way to defragment. Engineering didn't even want running without a SLOG to be possible (It was a business ask.), and that was before SSDs even existed. Don't use ZFS on an HDD unless you have a spare device (even if it's an HDD) that you can dedicate to SLOGging.
Best filesystem for PCs with only SSDs
No namespace support
Ensure that you use SSDs that either have no DRAM or that have PLP.
ZFS (compression=zstd-fast-1,encryption=aes-128-gcm,recordsize=64K,logbias=throughput,zfs_immediate_write_sz=64K,checksum=fletcher4,noatime
) + ZFSBootMenu; queue depth at 32 for SATA or higher if SAS/NVMe
This skips the ZIL in a way that does not risk the durability of sync writes.
- The ZIL exists primarily to avoid fragmentation on HDDs and to provide a separate domain for sync writes; these benefits are less-important on SSDs (especially NVMes), and has a cost of doubling your writes.
Namespace support
Only available in some enterprise-grade NVMe SSDs.
Namespace 1: 12GiB SLOG
Namespace 2: ZFS (
compression=zstd-fast-1,encryption=aes-128-gcm,recordsize=64K,logbias=latency,zfs_immediate_write_sz=16K,checksum=fletcher4,noatime
) + ZFSBootMenu; high queue depth
Minimal performant ZFS setup
1 SATA HDD (CMR; ZFS overwhelms SMR) for VDEV (compression=zstd-fast-1,recordsize=256K,logbias=latency,zfsimmediatewrite_sz=64K,checksum=fletcher4,noatime); configure Linux's queue depth to 16 and enable Linux's I/O scheduler.
1 SATA SSD (no DRAM) with GPT partitions for SLOG and SVDEV (64K and smaller) (don't need PLP if no DRAM, but having DRAM improves perf) (NVMe with namespaces instead of partitions is way better perf; namespaces have their own sync domains); configure Linux's queue depth high (32 for SATA SSDs, 128 for SAS SSDs, whatever you want for NVMe SSDs) and disable Linux's I/O scheduler.
1 SATA HDD (CMR) for backups (zfs send -w | zfs recv -F); configure Linux's queue depth to 16 and enable Linux's I/O scheduler.
Double the drives for self-healing from corruption and uninterrupted operation after drive failure.
Regarding various options
Trim/defrag
Trim is for SSDs; defragmentation is for HDDs.
- Technically SSDs can also suffer degraded performance from fragmentation, but it's minor and it's not considered worthwhile to defragment them, particularly given that controllers apparently often misrepresent what is actually fragmented. If you insist on defragmenting your SSDs, once per year is probably the smallest interval you should consider.
For removeable media, do it continuously.
For long-term media, do it periodically.
Why: Continuous trim/defrag harms perf and longevity, but removable media aren't guaranteed to be present during periodic trims/defrags, so they should be trimmed/defragged continuously as a compromise.
Queue depth
HDDs are generally best at 16.
SSDs are generally best with high values.
ZFS's internal scheduler can only handle one max queue depth per drive across the entire OS. So on mixed-media pools, you either have to gimp your high-queue media by capping ZFS's queue low OR you have to let Linux's I/O scheduling handle the overqueuing. The latter is generally the better choice. In any case: disable Linux's I/O scheduling when your device's queue depth equals ZFS's.
Compression
In my own benchmarks, zstd-fast-1
is consistently faster and more-compressive than lz4
both in and outside of ZFS, for a comparable CPU hit. zstd-1
also beats lz4 in performance, but for slightly higher CPU; and it loses to zstd-fast-1
. Your machine may benchmark differently.
Encryption
Native encryption is generally best, when it's available; it's seamless and avoids an extra tool/complexity, and often comes with integrations.
AES-128 is the perfect choice. There's afaict no good reason to believe quantum computing is a viable threat to it in your lifetime. Even assuming Moore's Law for quantum does not make it economical to crack AES-128 except for matters of extreme national concern; and quantum's actual rate of advancement is nowhere near Moore's law. And classical computing is no threat to it, either. AES-256 is for the absolutely paranoid. It's appropriate for state secrets that need protection against decryption in the 22nd or 23rd century; not your personal files. Save your CPU cycles for useful work.
Deduplication
ZFS's online dedupe is crazy-intensive and only worth it in very specific contexts.
You can get 90% of the benefit for minimal system resources by periodically running
rdfind
; just make sure you tell it to use a 256-bit checksum. This performs file-based hardlink deduplication, and works on any filesystem that supports hardlinks (all Linux FSes).
Duplication
btrfs (with
data=dup
) and ZFS (withcopies=2
) allow for healing from bitrot on a single drive, but this halves the usable space. It's a great option for backup drives: a single drive at twice your capacity is cheaper and simpler than two drives and a DAS enclosure.ZFS defaults to duplicating
all
metadata, as there are severe data-loss risks if metadata is ever lost. If you are using a fault-tolerant RAID level, you may change this tomost
.
Checksums
Fletcher4 can be done 41 billion times per second vs BLAKE3's 7000 times per second on an AMD EPYC 4464P.
Fletcher4 will catch bitrot; it will not catch deliberate hash collisions, and it is unsuitable for deduplication.
- You do not need to worry about either: you're not using ZFS deduplication, and deliberate hash collisions are protected against by native encryption, by not
zfs send | zfs recv
ing from untrusted datasets, and by not giving people root-level access to datasets.
- You do not need to worry about either: you're not using ZFS deduplication, and deliberate hash collisions are protected against by native encryption, by not
atime
Set
atime
tonoatime
unless you have a very good reason to do otherwise.- The default
relatime
is still quite bad for performance and longevity, and benefits precious few applications, none of which you are ever likely to use. If you ever do use one of those apps, give it its own dataset (ZFS) or subvolume (btrfs) and haveatime
set only there. - Snapshotting filesystems (like btrfs) may, after a snapshot, duplicate all metadata when the
atime
is updated. This means that runningfind
just once can dramatically increase your total used space. On btrfs, this is a major problem, and it is also a problem on LVM; on ZFS, metadata is only partly duplicated, so it's about two orders of magnitude less of a problem than with btrfs, but it is still problematic. Combiningrelatime
or especiallyatime
with frequent automatic snapshotting can result in storage usage positively ballooning.
- The default
lazytime
defers updates to atime, ctime, and mtime by up to 12 hours. It makesrelatime
less of a performance hit; but even withnoatime
, it still helps with ctime and mtime changes. There is generally no reason to not always use it where it is available, outside of RAMdisks (where all the metadata is already in RAM anyway) and snapshotting (since unwritten times won't be reflected in the snapshots). Applications that really depend on time metadata should typically callfsync()
orsync()
all the time, which will make their time metadata immediately durable in spite oflazytime
. And when using hourly snapshots, it twelfths the amount of metadata that is duplicated (though can result in untrue timestamps in such snapshots). This shouldn't be a problem for mtime/ctime changes that accompany actual file changes, since those shouldn't be affected bylazytime
; and explicitly callingsync
beforezfs snapshot
should flush the lazytime changes to disk before the snapshot, ensuring consistency but no longer twelfthing the metadata duplications.
recordsize
There is simply a fixed cost to all I/O from the physical mechanics of the drives.
HDDs take the same amount of time to read all recordsizes below 128K; and even 128K is close to this threshold, though the exact figure varies by drive, and some are over 128K.
- Consequently, HDDs should never be set to a recordsize below 128K; it's literally pointless.
SSDs have a kink too, but it's lower — below 32K for SATA ones. It varies by drive.
In mixed media (HDD vdev + SSD svdev), I like to tune the HDD to 256K and the SSD to 64K; this allows each to focus on what they're good at. 128K is an unhappy compromise, just like inline ZIL.
Additional ZFS notes
Additional recommendations
On Linux, set
xattr=sa
for improved xattr/ACL performance.Set
normalization=FormD
if you are okay with the drawbacks of forcing everything to be UTF-8-only. It makes ‹ó› and ‹ó› act like the same character, and is more-performant thanFormC
. The other forms normalize too many characters, which harms performance and risks accidents.
Caveats
Unless you are running ZFS <2.0.0 or >=2.3.3, only do raw (
-w
)send
s from encrypted datasets or you risk a data loss bug. For backups, raw sends are optimal anyway. At time of writing, no-one ships ZFS 2.3.3 yet because it has not been released!Set
init_on_alloc=0
on your kernel commandline; the default works fine in normal Linux but kills ARC performance with ZFS.ZFS supports lazytime, but it lacks a way to set it automatically. If you want to use it, you need to manually
mount -o remount,lazytime "$DATASET"
. You can automate this by wrappingzfs
andmount
with custombash
scripts in/usr/local/bin
which secretly addlazytime
to the mount options of everything those commands mount; you can then override this by passingnolazytime
. You also need to write a script that remount all mounted filesystems withlazytime
, and schedule that to run after your filesystems are mounted during startup.ZFS does not have a built-in way to schedule periodic trims; you have to do it manually; and, ideally, trim each SSD in a pool at different times.
Do not put swap on ZFS; it does not work well. Likewise, do not put swap outside of encrypted ZFS unless you encrypt that swapfile, else contents of memory can be read by an attacker. Using zram swap avoids these complications. Do not mix zram swap with disk swap or you will face priority inversion. If you use zswap, you can set it to a weak compression algorithm and your zram swap to a strong compression algorithm to get a kind of tiered swap, where the least-needed items are the most-heavily compressed.