diff mbox series

btrfs: adjust overcommit logic when very close to full

Message ID b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com (mailing list archive)
State New, archived
Headers show
Series btrfs: adjust overcommit logic when very close to full | expand

Commit Message

Josef Bacik Sept. 18, 2023, 7:27 p.m. UTC
A user reported some unpleasant behavior with very small file systems.
The reproducer is this

mkfs.btrfs -f -m single -b 8g /dev/vdb
mount /dev/vdb /mnt/test
dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20

This will result in usage that looks like this

Overall:
    Device size:                   8.00GiB
    Device allocated:              8.00GiB
    Device unallocated:            1.00MiB
    Device missing:                  0.00B
    Device slack:                  2.00GiB
    Used:                          5.47GiB
    Free (estimated):              2.52GiB      (min: 2.52GiB)
    Free (statfs, df):               0.00B
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:                5.50MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
   /dev/vdb        7.99GiB

Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
   /dev/vdb        8.00MiB

System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
   /dev/vdb        4.00MiB

Unallocated:
   /dev/vdb        1.00MiB

As you can see we've gotten ourselves quite full with metadata, with all
of the disk being allocated for data.

On smaller file systems there's not a lot of time before we get full, so
our overcommit behavior bites us here.  Generally speaking data
reservations result in chunk allocations as we assume reservation ==
actual use for data.  This means at any point we could end up with a
chunk allocation for data, and if we're very close to full we could do
this before we have a chance to figure out that we need another metadata
chunk.

Address this by adjusting the overcommit logic.  Simply put we need to
take away 1 chunk from the available chunk space in case of a data
reservation.  This will allow us to stop overcommitting before we
potentially lose this space to a data allocation.  With this fix in
place we properly allocate a metadata chunk before we're completely
full, allowing for enough slack space in metadata.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/space-info.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

Comments

Boris Burkov Sept. 18, 2023, 8:14 p.m. UTC | #1
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> A user reported some unpleasant behavior with very small file systems.
> The reproducer is this
> 
> mkfs.btrfs -f -m single -b 8g /dev/vdb
> mount /dev/vdb /mnt/test
> dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> 
> This will result in usage that looks like this
> 
> Overall:
>     Device size:                   8.00GiB
>     Device allocated:              8.00GiB
>     Device unallocated:            1.00MiB
>     Device missing:                  0.00B
>     Device slack:                  2.00GiB
>     Used:                          5.47GiB
>     Free (estimated):              2.52GiB      (min: 2.52GiB)
>     Free (statfs, df):               0.00B
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:                5.50MiB      (used: 0.00B)
>     Multiple profiles:                  no
> 
> Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
>    /dev/vdb        7.99GiB
> 
> Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
>    /dev/vdb        8.00MiB
> 
> System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
>    /dev/vdb        4.00MiB
> 
> Unallocated:
>    /dev/vdb        1.00MiB
> 
> As you can see we've gotten ourselves quite full with metadata, with all
> of the disk being allocated for data.
> 
> On smaller file systems there's not a lot of time before we get full, so
> our overcommit behavior bites us here.  Generally speaking data
> reservations result in chunk allocations as we assume reservation ==
> actual use for data.  This means at any point we could end up with a
> chunk allocation for data, and if we're very close to full we could do
> this before we have a chance to figure out that we need another metadata
> chunk.
> 
> Address this by adjusting the overcommit logic.  Simply put we need to
> take away 1 chunk from the available chunk space in case of a data
> reservation.  This will allow us to stop overcommitting before we
> potentially lose this space to a data allocation.  With this fix in
> place we properly allocate a metadata chunk before we're completely
> full, allowing for enough slack space in metadata.

LGTM, this should help and I've been kicking around the same idea in my
head for a while.

I do think this is kind of a band-aid, though. It isn't hard to imagine
that you allocate data chunks up to the 1G, then allocate a metadata
chunk, then fragment/under-utilize the data to the point that you
actually fill up the metadata and get right back to this same point.

Long term, I think we still need more/smarter reclaim, but this should
be a good steam valve for the simple cases where we deterministically
gobble up all the unallocated space for data.

> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Boris Burkov <boris@bur.io>

> ---
>  fs/btrfs/space-info.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index d7e8cd4f140c..7aa53058d893 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
>  	factor = btrfs_bg_type_to_factor(profile);
>  	avail = div_u64(avail, factor);
>  
> +	/*
> +	 * Since data allocations immediately use block groups as part of the
> +	 * reservation, because we assume that data reservations will == actual
> +	 * usage, we could potentially overcommit and then immediately have that
> +	 * available space used by a data allocation, which could put us in a
> +	 * bind when we get close to filling the file system.
> +	 *
> +	 * To handle this simply remove 1G (which is our current maximum chunk
> +	 * allocation size) from the available space.  If we are relatively
> +	 * empty this won't affect our ability to overcommit much, and if we're
> +	 * very close to full it'll keep us from getting into a position where
> +	 * we've given ourselves very little metadata wiggle room.
> +	 */
> +	if (avail < SZ_1G)
> +		return 0;
> +	avail -= SZ_1G;
> +
>  	/*
>  	 * If we aren't flushing all things, let us overcommit up to
>  	 * 1/2th of the space. If we can flush, don't let us overcommit
> -- 
> 2.41.0
>
David Sterba Sept. 18, 2023, 9:29 p.m. UTC | #2
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> A user reported some unpleasant behavior with very small file systems.
> The reproducer is this
> 
> mkfs.btrfs -f -m single -b 8g /dev/vdb
> mount /dev/vdb /mnt/test
> dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> 
> This will result in usage that looks like this
> 
> Overall:
>     Device size:                   8.00GiB
>     Device allocated:              8.00GiB
>     Device unallocated:            1.00MiB
>     Device missing:                  0.00B
>     Device slack:                  2.00GiB
>     Used:                          5.47GiB
>     Free (estimated):              2.52GiB      (min: 2.52GiB)
>     Free (statfs, df):               0.00B
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:                5.50MiB      (used: 0.00B)
>     Multiple profiles:                  no
> 
> Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
>    /dev/vdb        7.99GiB
> 
> Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
>    /dev/vdb        8.00MiB
> 
> System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
>    /dev/vdb        4.00MiB
> 
> Unallocated:
>    /dev/vdb        1.00MiB
> 
> As you can see we've gotten ourselves quite full with metadata, with all
> of the disk being allocated for data.
> 
> On smaller file systems there's not a lot of time before we get full, so
> our overcommit behavior bites us here.  Generally speaking data
> reservations result in chunk allocations as we assume reservation ==
> actual use for data.  This means at any point we could end up with a
> chunk allocation for data, and if we're very close to full we could do
> this before we have a chance to figure out that we need another metadata
> chunk.
> 
> Address this by adjusting the overcommit logic.  Simply put we need to
> take away 1 chunk from the available chunk space in case of a data
> reservation.  This will allow us to stop overcommitting before we
> potentially lose this space to a data allocation.  With this fix in
> place we properly allocate a metadata chunk before we're completely
> full, allowing for enough slack space in metadata.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
>  fs/btrfs/space-info.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index d7e8cd4f140c..7aa53058d893 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
>  	factor = btrfs_bg_type_to_factor(profile);
>  	avail = div_u64(avail, factor);
>  
> +	/*
> +	 * Since data allocations immediately use block groups as part of the
> +	 * reservation, because we assume that data reservations will == actual
> +	 * usage, we could potentially overcommit and then immediately have that
> +	 * available space used by a data allocation, which could put us in a
> +	 * bind when we get close to filling the file system.
> +	 *
> +	 * To handle this simply remove 1G (which is our current maximum chunk
> +	 * allocation size) from the available space.  If we are relatively
> +	 * empty this won't affect our ability to overcommit much, and if we're
> +	 * very close to full it'll keep us from getting into a position where
> +	 * we've given ourselves very little metadata wiggle room.
> +	 */
> +	if (avail < SZ_1G)
> +		return 0;
> +	avail -= SZ_1G;

Should the value be derived from the alloc_chunk_ctl::max_chunk_size or
chunk_size? Or at least use a named constant, similar to
BTRFS_MAX_DATA_CHUNK_SIZE .

> +
>  	/*
>  	 * If we aren't flushing all things, let us overcommit up to
>  	 * 1/2th of the space. If we can flush, don't let us overcommit
> -- 
> 2.41.0
Josef Bacik Sept. 20, 2023, 1:59 p.m. UTC | #3
On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote:
> On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> > A user reported some unpleasant behavior with very small file systems.
> > The reproducer is this
> > 
> > mkfs.btrfs -f -m single -b 8g /dev/vdb
> > mount /dev/vdb /mnt/test
> > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> > 
> > This will result in usage that looks like this
> > 
> > Overall:
> >     Device size:                   8.00GiB
> >     Device allocated:              8.00GiB
> >     Device unallocated:            1.00MiB
> >     Device missing:                  0.00B
> >     Device slack:                  2.00GiB
> >     Used:                          5.47GiB
> >     Free (estimated):              2.52GiB      (min: 2.52GiB)
> >     Free (statfs, df):               0.00B
> >     Data ratio:                       1.00
> >     Metadata ratio:                   1.00
> >     Global reserve:                5.50MiB      (used: 0.00B)
> >     Multiple profiles:                  no
> > 
> > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
> >    /dev/vdb        7.99GiB
> > 
> > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
> >    /dev/vdb        8.00MiB
> > 
> > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> >    /dev/vdb        4.00MiB
> > 
> > Unallocated:
> >    /dev/vdb        1.00MiB
> > 
> > As you can see we've gotten ourselves quite full with metadata, with all
> > of the disk being allocated for data.
> > 
> > On smaller file systems there's not a lot of time before we get full, so
> > our overcommit behavior bites us here.  Generally speaking data
> > reservations result in chunk allocations as we assume reservation ==
> > actual use for data.  This means at any point we could end up with a
> > chunk allocation for data, and if we're very close to full we could do
> > this before we have a chance to figure out that we need another metadata
> > chunk.
> > 
> > Address this by adjusting the overcommit logic.  Simply put we need to
> > take away 1 chunk from the available chunk space in case of a data
> > reservation.  This will allow us to stop overcommitting before we
> > potentially lose this space to a data allocation.  With this fix in
> > place we properly allocate a metadata chunk before we're completely
> > full, allowing for enough slack space in metadata.
> 
> LGTM, this should help and I've been kicking around the same idea in my
> head for a while.
> 
> I do think this is kind of a band-aid, though. It isn't hard to imagine
> that you allocate data chunks up to the 1G, then allocate a metadata
> chunk, then fragment/under-utilize the data to the point that you
> actually fill up the metadata and get right back to this same point.
> 
> Long term, I think we still need more/smarter reclaim, but this should
> be a good steam valve for the simple cases where we deterministically
> gobble up all the unallocated space for data.

This is definitely a bit of a bandaid, because we can have any number of things
allocate a chunk at any given time, however this is more of a concern for small
file systems where we only have the initial 8mib metadata block group.

We spoke offline, but the long term fix here is to have a chunk reservation
system and use that in overcommit so we never have to worry about suddenly not
being able to allocate a chunk.  Then if we need to revoke that reservation we
can force flush everything to get us under the overcommit threshold, and then
disable overcommit because we'll have allocated that chunk.

For now this fixes the problem with the least surprise.  Thanks,

Josef
Josef Bacik Sept. 20, 2023, 2:01 p.m. UTC | #4
On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote:
> On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> > A user reported some unpleasant behavior with very small file systems.
> > The reproducer is this
> > 
> > mkfs.btrfs -f -m single -b 8g /dev/vdb
> > mount /dev/vdb /mnt/test
> > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> > 
> > This will result in usage that looks like this
> > 
> > Overall:
> >     Device size:                   8.00GiB
> >     Device allocated:              8.00GiB
> >     Device unallocated:            1.00MiB
> >     Device missing:                  0.00B
> >     Device slack:                  2.00GiB
> >     Used:                          5.47GiB
> >     Free (estimated):              2.52GiB      (min: 2.52GiB)
> >     Free (statfs, df):               0.00B
> >     Data ratio:                       1.00
> >     Metadata ratio:                   1.00
> >     Global reserve:                5.50MiB      (used: 0.00B)
> >     Multiple profiles:                  no
> > 
> > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
> >    /dev/vdb        7.99GiB
> > 
> > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
> >    /dev/vdb        8.00MiB
> > 
> > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> >    /dev/vdb        4.00MiB
> > 
> > Unallocated:
> >    /dev/vdb        1.00MiB
> > 
> > As you can see we've gotten ourselves quite full with metadata, with all
> > of the disk being allocated for data.
> > 
> > On smaller file systems there's not a lot of time before we get full, so
> > our overcommit behavior bites us here.  Generally speaking data
> > reservations result in chunk allocations as we assume reservation ==
> > actual use for data.  This means at any point we could end up with a
> > chunk allocation for data, and if we're very close to full we could do
> > this before we have a chance to figure out that we need another metadata
> > chunk.
> > 
> > Address this by adjusting the overcommit logic.  Simply put we need to
> > take away 1 chunk from the available chunk space in case of a data
> > reservation.  This will allow us to stop overcommitting before we
> > potentially lose this space to a data allocation.  With this fix in
> > place we properly allocate a metadata chunk before we're completely
> > full, allowing for enough slack space in metadata.
> > 
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> >  fs/btrfs/space-info.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > index d7e8cd4f140c..7aa53058d893 100644
> > --- a/fs/btrfs/space-info.c
> > +++ b/fs/btrfs/space-info.c
> > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
> >  	factor = btrfs_bg_type_to_factor(profile);
> >  	avail = div_u64(avail, factor);
> >  
> > +	/*
> > +	 * Since data allocations immediately use block groups as part of the
> > +	 * reservation, because we assume that data reservations will == actual
> > +	 * usage, we could potentially overcommit and then immediately have that
> > +	 * available space used by a data allocation, which could put us in a
> > +	 * bind when we get close to filling the file system.
> > +	 *
> > +	 * To handle this simply remove 1G (which is our current maximum chunk
> > +	 * allocation size) from the available space.  If we are relatively
> > +	 * empty this won't affect our ability to overcommit much, and if we're
> > +	 * very close to full it'll keep us from getting into a position where
> > +	 * we've given ourselves very little metadata wiggle room.
> > +	 */
> > +	if (avail < SZ_1G)
> > +		return 0;
> > +	avail -= SZ_1G;
> 
> Should the value be derived from the alloc_chunk_ctl::max_chunk_size or
> chunk_size? Or at least use a named constant, similar to
> BTRFS_MAX_DATA_CHUNK_SIZE .

I originally had this, but we still have this logic in the chunk allocator

	/* Stripe size should not go beyond 1G. */
	ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G);

max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G.

We need to go back and clean up the logic to not have the SZ_1G here, rename it
to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use
this.

I opted for the cleanest fix for now, the cleanups can come after.
Alternatively I can do the cleanups and do the fix, it's up to you.  Thanks,

Josef
David Sterba Sept. 20, 2023, 2:04 p.m. UTC | #5
On Wed, Sep 20, 2023 at 10:01:47AM -0400, Josef Bacik wrote:
> On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote:
> > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> > > A user reported some unpleasant behavior with very small file systems.
> > > The reproducer is this
> > > 
> > > mkfs.btrfs -f -m single -b 8g /dev/vdb
> > > mount /dev/vdb /mnt/test
> > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> > > 
> > > This will result in usage that looks like this
> > > 
> > > Overall:
> > >     Device size:                   8.00GiB
> > >     Device allocated:              8.00GiB
> > >     Device unallocated:            1.00MiB
> > >     Device missing:                  0.00B
> > >     Device slack:                  2.00GiB
> > >     Used:                          5.47GiB
> > >     Free (estimated):              2.52GiB      (min: 2.52GiB)
> > >     Free (statfs, df):               0.00B
> > >     Data ratio:                       1.00
> > >     Metadata ratio:                   1.00
> > >     Global reserve:                5.50MiB      (used: 0.00B)
> > >     Multiple profiles:                  no
> > > 
> > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
> > >    /dev/vdb        7.99GiB
> > > 
> > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
> > >    /dev/vdb        8.00MiB
> > > 
> > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> > >    /dev/vdb        4.00MiB
> > > 
> > > Unallocated:
> > >    /dev/vdb        1.00MiB
> > > 
> > > As you can see we've gotten ourselves quite full with metadata, with all
> > > of the disk being allocated for data.
> > > 
> > > On smaller file systems there's not a lot of time before we get full, so
> > > our overcommit behavior bites us here.  Generally speaking data
> > > reservations result in chunk allocations as we assume reservation ==
> > > actual use for data.  This means at any point we could end up with a
> > > chunk allocation for data, and if we're very close to full we could do
> > > this before we have a chance to figure out that we need another metadata
> > > chunk.
> > > 
> > > Address this by adjusting the overcommit logic.  Simply put we need to
> > > take away 1 chunk from the available chunk space in case of a data
> > > reservation.  This will allow us to stop overcommitting before we
> > > potentially lose this space to a data allocation.  With this fix in
> > > place we properly allocate a metadata chunk before we're completely
> > > full, allowing for enough slack space in metadata.
> > > 
> > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > ---
> > >  fs/btrfs/space-info.c | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> > > index d7e8cd4f140c..7aa53058d893 100644
> > > --- a/fs/btrfs/space-info.c
> > > +++ b/fs/btrfs/space-info.c
> > > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
> > >  	factor = btrfs_bg_type_to_factor(profile);
> > >  	avail = div_u64(avail, factor);
> > >  
> > > +	/*
> > > +	 * Since data allocations immediately use block groups as part of the
> > > +	 * reservation, because we assume that data reservations will == actual
> > > +	 * usage, we could potentially overcommit and then immediately have that
> > > +	 * available space used by a data allocation, which could put us in a
> > > +	 * bind when we get close to filling the file system.
> > > +	 *
> > > +	 * To handle this simply remove 1G (which is our current maximum chunk
> > > +	 * allocation size) from the available space.  If we are relatively
> > > +	 * empty this won't affect our ability to overcommit much, and if we're
> > > +	 * very close to full it'll keep us from getting into a position where
> > > +	 * we've given ourselves very little metadata wiggle room.
> > > +	 */
> > > +	if (avail < SZ_1G)
> > > +		return 0;
> > > +	avail -= SZ_1G;
> > 
> > Should the value be derived from the alloc_chunk_ctl::max_chunk_size or
> > chunk_size? Or at least use a named constant, similar to
> > BTRFS_MAX_DATA_CHUNK_SIZE .
> 
> I originally had this, but we still have this logic in the chunk allocator
> 
> 	/* Stripe size should not go beyond 1G. */
> 	ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G);
> 
> max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G.
> 
> We need to go back and clean up the logic to not have the SZ_1G here, rename it
> to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use
> this.
> 
> I opted for the cleanest fix for now, the cleanups can come after.
> Alternatively I can do the cleanups and do the fix, it's up to you.  Thanks,

For a quick fix it's ok, at least it's commented and not just a random
magic constant. The cleanups can be done later. Thanks.
David Sterba Sept. 20, 2023, 7:02 p.m. UTC | #6
On Wed, Sep 20, 2023 at 09:59:23AM -0400, Josef Bacik wrote:
> On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote:
> > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> > > A user reported some unpleasant behavior with very small file systems.
> > > The reproducer is this
> > > 
> > > mkfs.btrfs -f -m single -b 8g /dev/vdb
> > > mount /dev/vdb /mnt/test
> > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> > > 
> > > This will result in usage that looks like this
> > > 
> > > Overall:
> > >     Device size:                   8.00GiB
> > >     Device allocated:              8.00GiB
> > >     Device unallocated:            1.00MiB
> > >     Device missing:                  0.00B
> > >     Device slack:                  2.00GiB
> > >     Used:                          5.47GiB
> > >     Free (estimated):              2.52GiB      (min: 2.52GiB)
> > >     Free (statfs, df):               0.00B
> > >     Data ratio:                       1.00
> > >     Metadata ratio:                   1.00
> > >     Global reserve:                5.50MiB      (used: 0.00B)
> > >     Multiple profiles:                  no
> > > 
> > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
> > >    /dev/vdb        7.99GiB
> > > 
> > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
> > >    /dev/vdb        8.00MiB
> > > 
> > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> > >    /dev/vdb        4.00MiB
> > > 
> > > Unallocated:
> > >    /dev/vdb        1.00MiB
> > > 
> > > As you can see we've gotten ourselves quite full with metadata, with all
> > > of the disk being allocated for data.
> > > 
> > > On smaller file systems there's not a lot of time before we get full, so
> > > our overcommit behavior bites us here.  Generally speaking data
> > > reservations result in chunk allocations as we assume reservation ==
> > > actual use for data.  This means at any point we could end up with a
> > > chunk allocation for data, and if we're very close to full we could do
> > > this before we have a chance to figure out that we need another metadata
> > > chunk.
> > > 
> > > Address this by adjusting the overcommit logic.  Simply put we need to
> > > take away 1 chunk from the available chunk space in case of a data
> > > reservation.  This will allow us to stop overcommitting before we
> > > potentially lose this space to a data allocation.  With this fix in
> > > place we properly allocate a metadata chunk before we're completely
> > > full, allowing for enough slack space in metadata.
> > 
> > LGTM, this should help and I've been kicking around the same idea in my
> > head for a while.
> > 
> > I do think this is kind of a band-aid, though. It isn't hard to imagine
> > that you allocate data chunks up to the 1G, then allocate a metadata
> > chunk, then fragment/under-utilize the data to the point that you
> > actually fill up the metadata and get right back to this same point.
> > 
> > Long term, I think we still need more/smarter reclaim, but this should
> > be a good steam valve for the simple cases where we deterministically
> > gobble up all the unallocated space for data.
> 
> This is definitely a bit of a bandaid, because we can have any number of things
> allocate a chunk at any given time, however this is more of a concern for small
> file systems where we only have the initial 8mib metadata block group.

Is really 8M for metadata? The default mkfs creates something like this:

Filesystem size:    4.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP             256.00MiB
  System:           DUP               8.00MiB
David Sterba Sept. 20, 2023, 7:05 p.m. UTC | #7
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> A user reported some unpleasant behavior with very small file systems.
> The reproducer is this
> 
> mkfs.btrfs -f -m single -b 8g /dev/vdb
> mount /dev/vdb /mnt/test
> dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> 
> This will result in usage that looks like this
> 
> Overall:
>     Device size:                   8.00GiB
>     Device allocated:              8.00GiB
>     Device unallocated:            1.00MiB
>     Device missing:                  0.00B
>     Device slack:                  2.00GiB
>     Used:                          5.47GiB
>     Free (estimated):              2.52GiB      (min: 2.52GiB)
>     Free (statfs, df):               0.00B
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:                5.50MiB      (used: 0.00B)
>     Multiple profiles:                  no
> 
> Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
>    /dev/vdb        7.99GiB
> 
> Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
>    /dev/vdb        8.00MiB
> 
> System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
>    /dev/vdb        4.00MiB
> 
> Unallocated:
>    /dev/vdb        1.00MiB
> 
> As you can see we've gotten ourselves quite full with metadata, with all
> of the disk being allocated for data.
> 
> On smaller file systems there's not a lot of time before we get full, so
> our overcommit behavior bites us here.  Generally speaking data
> reservations result in chunk allocations as we assume reservation ==
> actual use for data.  This means at any point we could end up with a
> chunk allocation for data, and if we're very close to full we could do
> this before we have a chance to figure out that we need another metadata
> chunk.
> 
> Address this by adjusting the overcommit logic.  Simply put we need to
> take away 1 chunk from the available chunk space in case of a data
> reservation.  This will allow us to stop overcommitting before we
> potentially lose this space to a data allocation.  With this fix in
> place we properly allocate a metadata chunk before we're completely
> full, allowing for enough slack space in metadata.
> 
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

Added to misc-next, thanks.
Filipe Manana Sept. 21, 2023, 10:50 p.m. UTC | #8
On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> > A user reported some unpleasant behavior with very small file systems.
> > The reproducer is this
> >
> > mkfs.btrfs -f -m single -b 8g /dev/vdb
> > mount /dev/vdb /mnt/test
> > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20
> >
> > This will result in usage that looks like this
> >
> > Overall:
> >     Device size:                   8.00GiB
> >     Device allocated:              8.00GiB
> >     Device unallocated:            1.00MiB
> >     Device missing:                  0.00B
> >     Device slack:                  2.00GiB
> >     Used:                          5.47GiB
> >     Free (estimated):              2.52GiB      (min: 2.52GiB)
> >     Free (statfs, df):               0.00B
> >     Data ratio:                       1.00
> >     Metadata ratio:                   1.00
> >     Global reserve:                5.50MiB      (used: 0.00B)
> >     Multiple profiles:                  no
> >
> > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%)
> >    /dev/vdb        7.99GiB
> >
> > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%)
> >    /dev/vdb        8.00MiB
> >
> > System,single: Size:4.00MiB, Used:16.00KiB (0.39%)
> >    /dev/vdb        4.00MiB
> >
> > Unallocated:
> >    /dev/vdb        1.00MiB
> >
> > As you can see we've gotten ourselves quite full with metadata, with all
> > of the disk being allocated for data.
> >
> > On smaller file systems there's not a lot of time before we get full, so
> > our overcommit behavior bites us here.  Generally speaking data
> > reservations result in chunk allocations as we assume reservation ==
> > actual use for data.  This means at any point we could end up with a
> > chunk allocation for data, and if we're very close to full we could do
> > this before we have a chance to figure out that we need another metadata
> > chunk.
> >
> > Address this by adjusting the overcommit logic.  Simply put we need to
> > take away 1 chunk from the available chunk space in case of a data
> > reservation.  This will allow us to stop overcommitting before we
> > potentially lose this space to a data allocation.  With this fix in
> > place we properly allocate a metadata chunk before we're completely
> > full, allowing for enough slack space in metadata.
> >
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
>
> Added to misc-next, thanks.

So after this change, at least 2 test cases from fstests are failing
with -ENOSPC on misc-next:

$ ./check btrfs/156 btrfs/177
FSTYP         -- btrfs
PLATFORM      -- Linux/x86_64 debian0 6.6.0-rc2-btrfs-next-138+ #1 SMP
PREEMPT_DYNAMIC Thu Sep 21 17:58:48 WEST 2023
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/156 14s ... - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad)
    --- tests/btrfs/156.out 2020-06-10 19:29:03.818519162 +0100
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad
2023-09-21 23:41:49.911844058 +0100
    @@ -1,2 +1,6 @@
     QA output created by 156
    +ERROR: error during balancing
'/home/fdmanana/btrfs-tests/scratch_1': No space left on device
    +There may be more info in syslog - try dmesg | tail
    +ERROR: error during balancing
'/home/fdmanana/btrfs-tests/scratch_1': No space left on device
    +There may be more info in syslog - try dmesg | tail
     Silence is golden
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/156.out
/home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad'  to see
the entire diff)
btrfs/177 2s ... - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad)
    --- tests/btrfs/177.out 2023-03-02 21:47:53.872609330 +0000
    +++ /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad
2023-09-21 23:41:57.200117690 +0100
    @@ -1,4 +1,8 @@
     QA output created by 177
    +ERROR: error during balancing
'/home/fdmanana/btrfs-tests/scratch_1': No space left on device
    +There may be more info in syslog - try dmesg | tail
     Resized to 3221225472
    +ERROR: error during balancing
'/home/fdmanana/btrfs-tests/scratch_1': No space left on device
    +There may be more info in syslog - try dmesg | tail
     Text file busy
    ...
    (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/177.out
/home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad'  to see
the entire diff)
Ran: btrfs/156 btrfs/177
Failures: btrfs/156 btrfs/177
Failed 2 of 2 tests

The tests pass again after reverting this change.

Thanks.
David Sterba Sept. 22, 2023, 10:25 a.m. UTC | #9
On Thu, Sep 21, 2023 at 11:50:24PM +0100, Filipe Manana wrote:
> On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote:
> >
> > Added to misc-next, thanks.
> 
> So after this change, at least 2 test cases from fstests are failing
> with -ENOSPC on misc-next:
> 
> $ ./check btrfs/156 btrfs/177

Thanks, I've removed the patch from misc-next for now.
diff mbox series

Patch

diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index d7e8cd4f140c..7aa53058d893 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -365,6 +365,23 @@  static u64 calc_available_free_space(struct btrfs_fs_info *fs_info,
 	factor = btrfs_bg_type_to_factor(profile);
 	avail = div_u64(avail, factor);
 
+	/*
+	 * Since data allocations immediately use block groups as part of the
+	 * reservation, because we assume that data reservations will == actual
+	 * usage, we could potentially overcommit and then immediately have that
+	 * available space used by a data allocation, which could put us in a
+	 * bind when we get close to filling the file system.
+	 *
+	 * To handle this simply remove 1G (which is our current maximum chunk
+	 * allocation size) from the available space.  If we are relatively
+	 * empty this won't affect our ability to overcommit much, and if we're
+	 * very close to full it'll keep us from getting into a position where
+	 * we've given ourselves very little metadata wiggle room.
+	 */
+	if (avail < SZ_1G)
+		return 0;
+	avail -= SZ_1G;
+
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
 	 * 1/2th of the space. If we can flush, don't let us overcommit