mbox series

[RFC,0/6] btrfs: dynamic and periodic block_group reclaim

Message ID cover.1706914865.git.boris@bur.io (mailing list archive)
Headers show
Series btrfs: dynamic and periodic block_group reclaim | expand

Message

Boris Burkov Feb. 2, 2024, 11:12 p.m. UTC
Btrfs's block_group allocator suffers from a well known problem, that
it is capable of eagerly allocating too much space to either data or
metadata (most often data, absent bugs) and then later be unable to
allocate more space for the other, when needed. When data starves
metadata, this can extra painfully result in read only filesystems that
need careful manual balancing to fix.

This can be worked around by:
- enabling automatic reclaim
- periodically running balance

Neither of these enjoy widespread use, as far as I know, though the
former is used at scale at Meta with good results.

This patch set expands on automatic reclaim, adding the ability to set a
dynamic reclaim threshold that appropriately scales with the global file
system allocation conditions as well as periodic reclaim which runs that
reclaim sweep in the cleaner thread. Together, I believe they constitute
a robust and general automatic reclaim system that should avoid
unfortunate read only filesystems in all but extreme conditions, where
space is running quite low anyway and failure is more reasonable.

I ran it on three workloads (described in detail on the dynamic reclaim
patch) but they are:
1. bounce allocations around X% full.
2. fill up all the way and introduce full fragmentation.
3. write in a fragmented way until the filesystem is just about full.
script can be found here:
https://github.com/boryas/scripts/tree/main/fio/reclaim

The important results can be seen here (full results explorable at
bur.io/dyn-rec/)

bounce at 30%, much higher relocations with a fixed threshold:
https://bur.io/dyn-rec/bounce-30/relocs.png

hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
https://bur.io/dyn-rec/strict_frag-30/relocs.png

fill it all the way up, not crazy churn, but saving a buffer:
https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
https://bur.io/dyn-rec/last_gig/relocs.png
https://bur.io/dyn-rec/last_gig/thresh.png

Boris Burkov (6):
  btrfs: report reclaim count in sysfs
  btrfs: store fs_info on space_info
  btrfs: dynamic block_group reclaim threshold
  btrfs: periodic block_group reclaim
  btrfs: urgent periodic reclaim pass
  btrfs: prevent pathological periodic reclaim loops

 fs/btrfs/block-group.c |  26 ++++---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/space-info.c  | 165 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/space-info.h  |  28 +++++++
 fs/btrfs/sysfs.c       |  79 +++++++++++++++++++-
 5 files changed, 289 insertions(+), 10 deletions(-)

Comments

David Sterba Feb. 6, 2024, 2:55 p.m. UTC | #1
On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> Btrfs's block_group allocator suffers from a well known problem, that
> it is capable of eagerly allocating too much space to either data or
> metadata (most often data, absent bugs) and then later be unable to
> allocate more space for the other, when needed. When data starves
> metadata, this can extra painfully result in read only filesystems that
> need careful manual balancing to fix.
> 
> This can be worked around by:
> - enabling automatic reclaim
> - periodically running balance
> 
> Neither of these enjoy widespread use, as far as I know, though the
> former is used at scale at Meta with good results.

https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
and installed on distros.  (Also my most starred project on github.)

The idea is to make the balance separate from kernel, allowing users and
administrators to easily tweak the parameters and timing. We haven't
added automatic reclaim to kernel as it tends to start at the worst
time. The jobs from btrfsmaintenance are scheduled according to the
calendar events (systemd.timer).

Also the jobs don't have to be ran at all, the package not installed.

The problem with balancing amount of data and metadata chunks is known
and there are only heuristics, we can't solve that without knowing the
exact usage pattern.

> This patch set expands on automatic reclaim, adding the ability to set a
> dynamic reclaim threshold that appropriately scales with the global file
> system allocation conditions as well as periodic reclaim which runs that
> reclaim sweep in the cleaner thread. Together, I believe they constitute
> a robust and general automatic reclaim system that should avoid
> unfortunate read only filesystems in all but extreme conditions, where
> space is running quite low anyway and failure is more reasonable.
> 
> I ran it on three workloads (described in detail on the dynamic reclaim
> patch) but they are:
> 1. bounce allocations around X% full.
> 2. fill up all the way and introduce full fragmentation.
> 3. write in a fragmented way until the filesystem is just about full.
> script can be found here:
> https://github.com/boryas/scripts/tree/main/fio/reclaim

A common workload on distros is regular system update (rolling distro)
with snapshots (snapper) and cleanup. This can create a lot of under
used block groups, both data and metadata. Reclaiming that preriodically
was one of the ground ideas for the btrfsmaintenance project.

The reclaim is needed to make the space more compact as the randomly
removed unused extents create holes for new data so this is a good
example for either scripted or automatic reclaim.

However you can also find use case where this would harm performance or
just waste IO as the data are short lived and shuffling around unused
block groups does not help much.

The exact parameters of auto reclaim also depend on the storage type, an
NVMe would be probably fine with any amount of data, HDD not so much.

I don't know from your description above what's the estimated frequency
of the reclaim? I understand that the urgent reclaim would start as
needed, but otherwise the frequency of reclaim of say 30% used block
groups can stay fine for a few days, as there are usually more new data
than deletions.

Also with more block groups around it's more likely to find good
candidates for the size classes and then do the placement.

> The important results can be seen here (full results explorable at
> bur.io/dyn-rec/)
> 
> bounce at 30%, much higher relocations with a fixed threshold:
> https://bur.io/dyn-rec/bounce-30/relocs.png
> 
> hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> https://bur.io/dyn-rec/strict_frag-30/relocs.png
> 
> fill it all the way up, not crazy churn, but saving a buffer:
> https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> https://bur.io/dyn-rec/last_gig/relocs.png
> https://bur.io/dyn-rec/last_gig/thresh.png
> 
> Boris Burkov (6):
>   btrfs: report reclaim count in sysfs
>   btrfs: store fs_info on space_info
>   btrfs: dynamic block_group reclaim threshold
>   btrfs: periodic block_group reclaim
>   btrfs: urgent periodic reclaim pass
>   btrfs: prevent pathological periodic reclaim loops

So one thing is to have the mechanism for the reclaim, I think that's
the easy part, the tuning will be interesting.
Boris Burkov Feb. 6, 2024, 10:07 p.m. UTC | #2
On Tue, Feb 06, 2024 at 03:55:24PM +0100, David Sterba wrote:
> On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> > Btrfs's block_group allocator suffers from a well known problem, that
> > it is capable of eagerly allocating too much space to either data or
> > metadata (most often data, absent bugs) and then later be unable to
> > allocate more space for the other, when needed. When data starves
> > metadata, this can extra painfully result in read only filesystems that
> > need careful manual balancing to fix.
> > 
> > This can be worked around by:
> > - enabling automatic reclaim
> > - periodically running balance
> > 
> > Neither of these enjoy widespread use, as far as I know, though the
> > former is used at scale at Meta with good results.
> 
> https://github.com/kdave/btrfsmaintenance is to my knowledge widely used
> and installed on distros.  (Also my most starred project on github.)

Oh, cool, I'm glad that is out there and being used. I'm sorry for my ignorance.

> 
> The idea is to make the balance separate from kernel, allowing users and
> administrators to easily tweak the parameters and timing. We haven't
> added automatic reclaim to kernel as it tends to start at the worst
> time. The jobs from btrfsmaintenance are scheduled according to the
> calendar events (systemd.timer).

Makes sense.

> 
> Also the jobs don't have to be ran at all, the package not installed.
> 
> The problem with balancing amount of data and metadata chunks is known
> and there are only heuristics, we can't solve that without knowing the
> exact usage pattern.

Agreed.

> 
> > This patch set expands on automatic reclaim, adding the ability to set a
> > dynamic reclaim threshold that appropriately scales with the global file
> > system allocation conditions as well as periodic reclaim which runs that
> > reclaim sweep in the cleaner thread. Together, I believe they constitute
> > a robust and general automatic reclaim system that should avoid
> > unfortunate read only filesystems in all but extreme conditions, where
> > space is running quite low anyway and failure is more reasonable.
> > 
> > I ran it on three workloads (described in detail on the dynamic reclaim
> > patch) but they are:
> > 1. bounce allocations around X% full.
> > 2. fill up all the way and introduce full fragmentation.
> > 3. write in a fragmented way until the filesystem is just about full.
> > script can be found here:
> > https://github.com/boryas/scripts/tree/main/fio/reclaim
> 
> A common workload on distros is regular system update (rolling distro)
> with snapshots (snapper) and cleanup. This can create a lot of under
> used block groups, both data and metadata. Reclaiming that preriodically
> was one of the ground ideas for the btrfsmaintenance project.

I believe this is pretty similar to my workload 2 in spirit, except I
haven't done much with snapshots. I would love to run this workload so
I'll try to set it up with a VM. If you have a script for it already, or
even tips for setting it up, I would be quite grateful :)

I think that the "lots of random deletes leave empty block groups"
workload is the most interesting one in general for reclaim, and I
think it's cool that it happens in the real world :)

> 
> The reclaim is needed to make the space more compact as the randomly
> removed unused extents create holes for new data so this is a good
> example for either scripted or automatic reclaim.
> 
> However you can also find use case where this would harm performance or
> just waste IO as the data are short lived and shuffling around unused
> block groups does not help much.

+1, definitely trying to avoid this.

> 
> The exact parameters of auto reclaim also depend on the storage type, an
> NVMe would be probably fine with any amount of data, HDD not so much.

Good point, have only tested on NVMe. Definitely needs to be tunable to
not abuse HDDs.

> 
> I don't know from your description above what's the estimated frequency
> of the reclaim? I understand that the urgent reclaim would start as
> needed, but otherwise the frequency of reclaim of say 30% used block
> groups can stay fine for a few days, as there are usually more new data
> than deletions.
> 
> Also with more block groups around it's more likely to find good
> candidates for the size classes and then do the placement.

I think talking about my workload 2 here is helpful. Roughly, it writes
out 100G in a ~110G disk, then deletes 70G in perfectly fragmenting
stripes, so if we were way too aggressive, or used the current
autoreclaim with an unlucky threshold, we would reclaim all 100
block_groups. Dynamic reclaim's threshold spikes up to max, relocates 7
block groups, which is enough to negative feedback loop it back to a low
threshold and not doing more reclaim.

see https://bur.io/dyn-rec/strict_frag-30/thresh.png for the threshold
curve and https://bur.io/dyn-rec/strict_frag-30/relocs.png for the
reclaim counts. (I didn't hack it up perfectly evilly to make the 30%
threshold config relocate 100 block groups in that graph, FWIW)

I will try to more systematically plot threshold curves to get a better
sense for how to cause the most reclaims possible for a worst case
estimate.

In case you were asking more about the period it runs at:
As written right now, it runs with every cleaner thread run, but skips
block_groups that got an allocation since the last cleaner thread run. I
think you make an excellent point that the rate is much better to be more
like "daily" or "weekly" rather than "minutely". That gives more time to
reach a quiescent state, fill in gaps with small writes, etc. At the
minimum, I think the periodic reclaim ought to have a configurable period
with a relatively long default (this should help with HDD too?)

> 
> > The important results can be seen here (full results explorable at
> > bur.io/dyn-rec/)
> > 
> > bounce at 30%, much higher relocations with a fixed threshold:
> > https://bur.io/dyn-rec/bounce-30/relocs.png
> > 
> > hard 30% fragmentation, dynamic actually reclaims, relocs not crazy:
> > https://bur.io/dyn-rec/strict_frag-30/unalloc_bytes.png
> > https://bur.io/dyn-rec/strict_frag-30/relocs.png
> > 
> > fill it all the way up, not crazy churn, but saving a buffer:
> > https://bur.io/dyn-rec/last_gig/unalloc_bytes.png
> > https://bur.io/dyn-rec/last_gig/relocs.png
> > https://bur.io/dyn-rec/last_gig/thresh.png
> > 
> > Boris Burkov (6):
> >   btrfs: report reclaim count in sysfs
> >   btrfs: store fs_info on space_info
> >   btrfs: dynamic block_group reclaim threshold
> >   btrfs: periodic block_group reclaim
> >   btrfs: urgent periodic reclaim pass
> >   btrfs: prevent pathological periodic reclaim loops
> 
> So one thing is to have the mechanism for the reclaim, I think that's
> the easy part, the tuning will be interesting.

My 2c based on what I learned from this effort, and from your feedback:

Our two goals should be:
1. Avoid unnecessary reclaim, it wastes user resources and can hurt
   their system's performance.
2. Prevent unallocated=1MiB before it's too late.

I think anything with a fixed threshold is unlikely to fully achieve
either goal, as unlucky workloads will either operate below the
threshold and reclaim too much or above it and never reclaim.

I believe the dynamic threshold with a negative feedback loop is the
right sort of idea for achieving both goals. Ultimately, it is a
continuous function that encodes "reclaim at all costs when it's really
bad, don't reclaim much otherwise". I think it could also work to get
rid of the extra distraction from modelling it with a continuous
function and trying to encode the two goals more discretely/directly.

i.e.,
Very long period, low threshold periodic maintenance (basically exactly
btrfsmaintenance, doesn't need to be in kernel) and the kernel having "urgent"
conditions where it reclaims more aggressively in a limited way, just to get us
back to a few gigs of unalloc.

I also saw that btrfsmaintenance defaults to dusage=5 then dusage=10
which is lower (but similar to!) the quiescent state thresholds I have
seen in my tests (around 15-20). I may try to tune it to land around 10%
for most healthy fses, as that seems to be the safest number we know.

By the way, I think the dynamic threshold could be implemented fully in
userspace by using the limit flag of balance and recalculating the threshold
between each reclaim. Would you be more interested in experimenting with
that in btrfsmaintenance? I do think that in the long run, some kind of
"urgent unalloc protection" does belong in the kernel by default, assuming
we can really nail it down perfectly.

Thanks for your feedback,
Boris
David Sterba Feb. 19, 2024, 7:38 p.m. UTC | #3
On Tue, Feb 06, 2024 at 02:07:52PM -0800, Boris Burkov wrote:
> On Tue, Feb 06, 2024 at 03:55:24PM +0100, David Sterba wrote:
> > On Fri, Feb 02, 2024 at 03:12:42PM -0800, Boris Burkov wrote:
> > A common workload on distros is regular system update (rolling distro)
> > with snapshots (snapper) and cleanup. This can create a lot of under
> > used block groups, both data and metadata. Reclaiming that preriodically
> > was one of the ground ideas for the btrfsmaintenance project.
> 
> I believe this is pretty similar to my workload 2 in spirit, except I
> haven't done much with snapshots. I would love to run this workload so
> I'll try to set it up with a VM. If you have a script for it already, or
> even tips for setting it up, I would be quite grateful :)
> 
> I think that the "lots of random deletes leave empty block groups"
> workload is the most interesting one in general for reclaim, and I
> think it's cool that it happens in the real world :)

As a simulation of that I'm using git based workload that randomly
checks out commits and does a snapshot. A once working script is
herehttps://github.com/kdave/testunion/blob/master/test-snapgit/startme
(I maybe have some updates but I'd have to find the most recent version).

The used git repo should provide large files too so it's closer to what
eg. rpm does.

> > The exact parameters of auto reclaim also depend on the storage type, an
> > NVMe would be probably fine with any amount of data, HDD not so much.
> 
> Good point, have only tested on NVMe. Definitely needs to be tunable to
> not abuse HDDs.

I think we'll need an automatic classification of devices, now it's
third type that I know could use it (raid mirror balancing, checksum
offload and this one).

There's more to reply to, I'll continue on another day.