Message ID | cover.1718665689.git.boris@bur.io (mailing list archive) |
---|---|
Headers | show |
Series | btrfs: dynamic and periodic block_group reclaim | expand |
On Mon, Jun 17, 2024 at 04:11:12PM -0700, Boris Burkov wrote: > Btrfs's block_group allocator suffers from a well known problem, that > it is capable of eagerly allocating too much space to either data or > metadata (most often data, absent bugs) and then later be unable to > allocate more space for the other, when needed. When data starves > metadata, this can extra painfully result in read only filesystems that > need careful manual balancing to fix. > > This can be worked around by: > - enabling automatic reclaim > - periodically running balance > > The latter is widely deployed via btrfsmaintenance > (https://github.com/kdave/btrfsmaintenance) and the former is used at > scale at Meta with good results. However, neither of those solutions is > perfect, as they both currently use a fixed threshold. A fixed threshold > is vulnerable to workloads that trigger high amounts of reclaim. This > has led to btrfsmaintenance setting very conservative thresholds of 5 > and 10 percent of data block groups. > (https://github.com/kdave/btrfsmaintenance/commit/edbbfffe592f47c2849a8825f523e2ccc38b15f5) > At Meta, we deal with an elevated level of reclaim which would be > desirable to reduce. > > This patch set expands on automatic reclaim, adding the ability to set a > dynamic reclaim threshold that appropriately scales with the global file > system allocation conditions as well as periodic reclaim which runs that > reclaim sweep in the cleaner thread. Together, I believe they constitute > a robust and general automatic reclaim system that should avoid > unfortunate read only filesystems in all but extreme conditions, where > space is running quite low anyway and failure is more reasonable. > > At a very high level, the dynamic threshold's strategy is to set a fixed > target of unallocated block groups (10 block groups) and linearly scale > its aggression the further we are from that target. That way we do no > automatic reclaim until we actually press against the unallocated > target, allowing the allocator to gradually fill fragmented space with > new extents, but do claw back space after workloads that use and free a > bunch of space, perhaps with fragmentation. > > I ran it on three workloads (described in detail on the dynamic reclaim > patch) but they are: > 1. bounce allocations around X% full. > 2. fill up all the way and introduce full fragmentation. > 3. write in a fragmented way until the filesystem is just about full. > script can be found here: > https://github.com/boryas/scripts/tree/main/fio/reclaim > > The important results can be seen here (full results explorable at > https://bur.io/dyn-rec/) > > bounce at 30%, higher relocations with a fixed threshold: > https://bur.io/dyn-rec/bounce/reclaims.png > https://bur.io/dyn-rec/bounce/reclaim_bytes.png > https://bur.io/dyn-rec/bounce/unalloc_bytes.png > > hard 30% fragmentation, dynamic actually reclaims, relocs not crazy: > https://bur.io/dyn-rec/strict_frag/reclaims.png > https://bur.io/dyn-rec/strict_frag/reclaim_bytes.png > https://bur.io/dyn-rec/strict_frag/unalloc_bytes.png > > fill it all the way up in a fragmented way, then keep making > allocations: > https://bur.io/dyn-rec/last_gig/reclaims.png > https://bur.io/dyn-rec/last_gig/reclaim_bytes.png > https://bur.io/dyn-rec/last_gig/unalloc_bytes.png These results are great, once you fix up the one comment I had you can add Reviewed-by: Josef Bacik <josef@toxicpanda.com> to the whole series. Thanks, Josef