Message ID | b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: adjust overcommit logic when very close to full | expand |
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. LGTM, this should help and I've been kicking around the same idea in my head for a while. I do think this is kind of a band-aid, though. It isn't hard to imagine that you allocate data chunks up to the 1G, then allocate a metadata chunk, then fragment/under-utilize the data to the point that you actually fill up the metadata and get right back to this same point. Long term, I think we still need more/smarter reclaim, but this should be a good steam valve for the simple cases where we deterministically gobble up all the unallocated space for data. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/space-info.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index d7e8cd4f140c..7aa53058d893 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > factor = btrfs_bg_type_to_factor(profile); > avail = div_u64(avail, factor); > > + /* > + * Since data allocations immediately use block groups as part of the > + * reservation, because we assume that data reservations will == actual > + * usage, we could potentially overcommit and then immediately have that > + * available space used by a data allocation, which could put us in a > + * bind when we get close to filling the file system. > + * > + * To handle this simply remove 1G (which is our current maximum chunk > + * allocation size) from the available space. If we are relatively > + * empty this won't affect our ability to overcommit much, and if we're > + * very close to full it'll keep us from getting into a position where > + * we've given ourselves very little metadata wiggle room. > + */ > + if (avail < SZ_1G) > + return 0; > + avail -= SZ_1G; > + > /* > * If we aren't flushing all things, let us overcommit up to > * 1/2th of the space. If we can flush, don't let us overcommit > -- > 2.41.0 >
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > --- > fs/btrfs/space-info.c | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index d7e8cd4f140c..7aa53058d893 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > factor = btrfs_bg_type_to_factor(profile); > avail = div_u64(avail, factor); > > + /* > + * Since data allocations immediately use block groups as part of the > + * reservation, because we assume that data reservations will == actual > + * usage, we could potentially overcommit and then immediately have that > + * available space used by a data allocation, which could put us in a > + * bind when we get close to filling the file system. > + * > + * To handle this simply remove 1G (which is our current maximum chunk > + * allocation size) from the available space. If we are relatively > + * empty this won't affect our ability to overcommit much, and if we're > + * very close to full it'll keep us from getting into a position where > + * we've given ourselves very little metadata wiggle room. > + */ > + if (avail < SZ_1G) > + return 0; > + avail -= SZ_1G; Should the value be derived from the alloc_chunk_ctl::max_chunk_size or chunk_size? Or at least use a named constant, similar to BTRFS_MAX_DATA_CHUNK_SIZE . > + > /* > * If we aren't flushing all things, let us overcommit up to > * 1/2th of the space. If we can flush, don't let us overcommit > -- > 2.41.0
On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote: > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > LGTM, this should help and I've been kicking around the same idea in my > head for a while. > > I do think this is kind of a band-aid, though. It isn't hard to imagine > that you allocate data chunks up to the 1G, then allocate a metadata > chunk, then fragment/under-utilize the data to the point that you > actually fill up the metadata and get right back to this same point. > > Long term, I think we still need more/smarter reclaim, but this should > be a good steam valve for the simple cases where we deterministically > gobble up all the unallocated space for data. This is definitely a bit of a bandaid, because we can have any number of things allocate a chunk at any given time, however this is more of a concern for small file systems where we only have the initial 8mib metadata block group. We spoke offline, but the long term fix here is to have a chunk reservation system and use that in overcommit so we never have to worry about suddenly not being able to allocate a chunk. Then if we need to revoke that reservation we can force flush everything to get us under the overcommit threshold, and then disable overcommit because we'll have allocated that chunk. For now this fixes the problem with the least surprise. Thanks, Josef
On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote: > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > --- > > fs/btrfs/space-info.c | 17 +++++++++++++++++ > > 1 file changed, 17 insertions(+) > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > > index d7e8cd4f140c..7aa53058d893 100644 > > --- a/fs/btrfs/space-info.c > > +++ b/fs/btrfs/space-info.c > > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > > factor = btrfs_bg_type_to_factor(profile); > > avail = div_u64(avail, factor); > > > > + /* > > + * Since data allocations immediately use block groups as part of the > > + * reservation, because we assume that data reservations will == actual > > + * usage, we could potentially overcommit and then immediately have that > > + * available space used by a data allocation, which could put us in a > > + * bind when we get close to filling the file system. > > + * > > + * To handle this simply remove 1G (which is our current maximum chunk > > + * allocation size) from the available space. If we are relatively > > + * empty this won't affect our ability to overcommit much, and if we're > > + * very close to full it'll keep us from getting into a position where > > + * we've given ourselves very little metadata wiggle room. > > + */ > > + if (avail < SZ_1G) > > + return 0; > > + avail -= SZ_1G; > > Should the value be derived from the alloc_chunk_ctl::max_chunk_size or > chunk_size? Or at least use a named constant, similar to > BTRFS_MAX_DATA_CHUNK_SIZE . I originally had this, but we still have this logic in the chunk allocator /* Stripe size should not go beyond 1G. */ ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G); max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G. We need to go back and clean up the logic to not have the SZ_1G here, rename it to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use this. I opted for the cleanest fix for now, the cleanups can come after. Alternatively I can do the cleanups and do the fix, it's up to you. Thanks, Josef
On Wed, Sep 20, 2023 at 10:01:47AM -0400, Josef Bacik wrote: > On Mon, Sep 18, 2023 at 11:29:50PM +0200, David Sterba wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > A user reported some unpleasant behavior with very small file systems. > > > The reproducer is this > > > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > > mount /dev/vdb /mnt/test > > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > > > This will result in usage that looks like this > > > > > > Overall: > > > Device size: 8.00GiB > > > Device allocated: 8.00GiB > > > Device unallocated: 1.00MiB > > > Device missing: 0.00B > > > Device slack: 2.00GiB > > > Used: 5.47GiB > > > Free (estimated): 2.52GiB (min: 2.52GiB) > > > Free (statfs, df): 0.00B > > > Data ratio: 1.00 > > > Metadata ratio: 1.00 > > > Global reserve: 5.50MiB (used: 0.00B) > > > Multiple profiles: no > > > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > > /dev/vdb 7.99GiB > > > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > > /dev/vdb 8.00MiB > > > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > > /dev/vdb 4.00MiB > > > > > > Unallocated: > > > /dev/vdb 1.00MiB > > > > > > As you can see we've gotten ourselves quite full with metadata, with all > > > of the disk being allocated for data. > > > > > > On smaller file systems there's not a lot of time before we get full, so > > > our overcommit behavior bites us here. Generally speaking data > > > reservations result in chunk allocations as we assume reservation == > > > actual use for data. This means at any point we could end up with a > > > chunk allocation for data, and if we're very close to full we could do > > > this before we have a chance to figure out that we need another metadata > > > chunk. > > > > > > Address this by adjusting the overcommit logic. Simply put we need to > > > take away 1 chunk from the available chunk space in case of a data > > > reservation. This will allow us to stop overcommitting before we > > > potentially lose this space to a data allocation. With this fix in > > > place we properly allocate a metadata chunk before we're completely > > > full, allowing for enough slack space in metadata. > > > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > > --- > > > fs/btrfs/space-info.c | 17 +++++++++++++++++ > > > 1 file changed, 17 insertions(+) > > > > > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > > > index d7e8cd4f140c..7aa53058d893 100644 > > > --- a/fs/btrfs/space-info.c > > > +++ b/fs/btrfs/space-info.c > > > @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, > > > factor = btrfs_bg_type_to_factor(profile); > > > avail = div_u64(avail, factor); > > > > > > + /* > > > + * Since data allocations immediately use block groups as part of the > > > + * reservation, because we assume that data reservations will == actual > > > + * usage, we could potentially overcommit and then immediately have that > > > + * available space used by a data allocation, which could put us in a > > > + * bind when we get close to filling the file system. > > > + * > > > + * To handle this simply remove 1G (which is our current maximum chunk > > > + * allocation size) from the available space. If we are relatively > > > + * empty this won't affect our ability to overcommit much, and if we're > > > + * very close to full it'll keep us from getting into a position where > > > + * we've given ourselves very little metadata wiggle room. > > > + */ > > > + if (avail < SZ_1G) > > > + return 0; > > > + avail -= SZ_1G; > > > > Should the value be derived from the alloc_chunk_ctl::max_chunk_size or > > chunk_size? Or at least use a named constant, similar to > > BTRFS_MAX_DATA_CHUNK_SIZE . > > I originally had this, but we still have this logic in the chunk allocator > > /* Stripe size should not go beyond 1G. */ > ctl->stripe_size = min_t(u64, ctl->stripe_size, SZ_1G); > > max_chunk_size tends to be larger than this, BTRFS_MAX_DATA_CHUNK_SIZE is 10G. > > We need to go back and clean up the logic to not have the SZ_1G here, rename it > to BTRFS_MAX_CHUNK_SIZE or something like that, and then update my patch to use > this. > > I opted for the cleanest fix for now, the cleanups can come after. > Alternatively I can do the cleanups and do the fix, it's up to you. Thanks, For a quick fix it's ok, at least it's commented and not just a random magic constant. The cleanups can be done later. Thanks.
On Wed, Sep 20, 2023 at 09:59:23AM -0400, Josef Bacik wrote: > On Mon, Sep 18, 2023 at 01:14:41PM -0700, Boris Burkov wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > A user reported some unpleasant behavior with very small file systems. > > > The reproducer is this > > > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > > mount /dev/vdb /mnt/test > > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > > > This will result in usage that looks like this > > > > > > Overall: > > > Device size: 8.00GiB > > > Device allocated: 8.00GiB > > > Device unallocated: 1.00MiB > > > Device missing: 0.00B > > > Device slack: 2.00GiB > > > Used: 5.47GiB > > > Free (estimated): 2.52GiB (min: 2.52GiB) > > > Free (statfs, df): 0.00B > > > Data ratio: 1.00 > > > Metadata ratio: 1.00 > > > Global reserve: 5.50MiB (used: 0.00B) > > > Multiple profiles: no > > > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > > /dev/vdb 7.99GiB > > > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > > /dev/vdb 8.00MiB > > > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > > /dev/vdb 4.00MiB > > > > > > Unallocated: > > > /dev/vdb 1.00MiB > > > > > > As you can see we've gotten ourselves quite full with metadata, with all > > > of the disk being allocated for data. > > > > > > On smaller file systems there's not a lot of time before we get full, so > > > our overcommit behavior bites us here. Generally speaking data > > > reservations result in chunk allocations as we assume reservation == > > > actual use for data. This means at any point we could end up with a > > > chunk allocation for data, and if we're very close to full we could do > > > this before we have a chance to figure out that we need another metadata > > > chunk. > > > > > > Address this by adjusting the overcommit logic. Simply put we need to > > > take away 1 chunk from the available chunk space in case of a data > > > reservation. This will allow us to stop overcommitting before we > > > potentially lose this space to a data allocation. With this fix in > > > place we properly allocate a metadata chunk before we're completely > > > full, allowing for enough slack space in metadata. > > > > LGTM, this should help and I've been kicking around the same idea in my > > head for a while. > > > > I do think this is kind of a band-aid, though. It isn't hard to imagine > > that you allocate data chunks up to the 1G, then allocate a metadata > > chunk, then fragment/under-utilize the data to the point that you > > actually fill up the metadata and get right back to this same point. > > > > Long term, I think we still need more/smarter reclaim, but this should > > be a good steam valve for the simple cases where we deterministically > > gobble up all the unallocated space for data. > > This is definitely a bit of a bandaid, because we can have any number of things > allocate a chunk at any given time, however this is more of a concern for small > file systems where we only have the initial 8mib metadata block group. Is really 8M for metadata? The default mkfs creates something like this: Filesystem size: 4.00GiB Block group profiles: Data: single 8.00MiB Metadata: DUP 256.00MiB System: DUP 8.00MiB
On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > A user reported some unpleasant behavior with very small file systems. > The reproducer is this > > mkfs.btrfs -f -m single -b 8g /dev/vdb > mount /dev/vdb /mnt/test > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > This will result in usage that looks like this > > Overall: > Device size: 8.00GiB > Device allocated: 8.00GiB > Device unallocated: 1.00MiB > Device missing: 0.00B > Device slack: 2.00GiB > Used: 5.47GiB > Free (estimated): 2.52GiB (min: 2.52GiB) > Free (statfs, df): 0.00B > Data ratio: 1.00 > Metadata ratio: 1.00 > Global reserve: 5.50MiB (used: 0.00B) > Multiple profiles: no > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > /dev/vdb 7.99GiB > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > /dev/vdb 8.00MiB > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > /dev/vdb 4.00MiB > > Unallocated: > /dev/vdb 1.00MiB > > As you can see we've gotten ourselves quite full with metadata, with all > of the disk being allocated for data. > > On smaller file systems there's not a lot of time before we get full, so > our overcommit behavior bites us here. Generally speaking data > reservations result in chunk allocations as we assume reservation == > actual use for data. This means at any point we could end up with a > chunk allocation for data, and if we're very close to full we could do > this before we have a chance to figure out that we need another metadata > chunk. > > Address this by adjusting the overcommit logic. Simply put we need to > take away 1 chunk from the available chunk space in case of a data > reservation. This will allow us to stop overcommitting before we > potentially lose this space to a data allocation. With this fix in > place we properly allocate a metadata chunk before we're completely > full, allowing for enough slack space in metadata. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Added to misc-next, thanks.
On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote: > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > A user reported some unpleasant behavior with very small file systems. > > The reproducer is this > > > > mkfs.btrfs -f -m single -b 8g /dev/vdb > > mount /dev/vdb /mnt/test > > dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 > > > > This will result in usage that looks like this > > > > Overall: > > Device size: 8.00GiB > > Device allocated: 8.00GiB > > Device unallocated: 1.00MiB > > Device missing: 0.00B > > Device slack: 2.00GiB > > Used: 5.47GiB > > Free (estimated): 2.52GiB (min: 2.52GiB) > > Free (statfs, df): 0.00B > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 5.50MiB (used: 0.00B) > > Multiple profiles: no > > > > Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) > > /dev/vdb 7.99GiB > > > > Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) > > /dev/vdb 8.00MiB > > > > System,single: Size:4.00MiB, Used:16.00KiB (0.39%) > > /dev/vdb 4.00MiB > > > > Unallocated: > > /dev/vdb 1.00MiB > > > > As you can see we've gotten ourselves quite full with metadata, with all > > of the disk being allocated for data. > > > > On smaller file systems there's not a lot of time before we get full, so > > our overcommit behavior bites us here. Generally speaking data > > reservations result in chunk allocations as we assume reservation == > > actual use for data. This means at any point we could end up with a > > chunk allocation for data, and if we're very close to full we could do > > this before we have a chance to figure out that we need another metadata > > chunk. > > > > Address this by adjusting the overcommit logic. Simply put we need to > > take away 1 chunk from the available chunk space in case of a data > > reservation. This will allow us to stop overcommitting before we > > potentially lose this space to a data allocation. With this fix in > > place we properly allocate a metadata chunk before we're completely > > full, allowing for enough slack space in metadata. > > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> > > Added to misc-next, thanks. So after this change, at least 2 test cases from fstests are failing with -ENOSPC on misc-next: $ ./check btrfs/156 btrfs/177 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.6.0-rc2-btrfs-next-138+ #1 SMP PREEMPT_DYNAMIC Thu Sep 21 17:58:48 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/156 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad) --- tests/btrfs/156.out 2020-06-10 19:29:03.818519162 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad 2023-09-21 23:41:49.911844058 +0100 @@ -1,2 +1,6 @@ QA output created by 156 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Silence is golden ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/156.out /home/fdmanana/git/hub/xfstests/results//btrfs/156.out.bad' to see the entire diff) btrfs/177 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad) --- tests/btrfs/177.out 2023-03-02 21:47:53.872609330 +0000 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad 2023-09-21 23:41:57.200117690 +0100 @@ -1,4 +1,8 @@ QA output created by 177 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Resized to 3221225472 +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No space left on device +There may be more info in syslog - try dmesg | tail Text file busy ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/177.out /home/fdmanana/git/hub/xfstests/results//btrfs/177.out.bad' to see the entire diff) Ran: btrfs/156 btrfs/177 Failures: btrfs/156 btrfs/177 Failed 2 of 2 tests The tests pass again after reverting this change. Thanks.
On Thu, Sep 21, 2023 at 11:50:24PM +0100, Filipe Manana wrote: > On Wed, Sep 20, 2023 at 11:44 PM David Sterba <dsterba@suse.cz> wrote: > > > > On Mon, Sep 18, 2023 at 03:27:47PM -0400, Josef Bacik wrote: > > > > Added to misc-next, thanks. > > So after this change, at least 2 test cases from fstests are failing > with -ENOSPC on misc-next: > > $ ./check btrfs/156 btrfs/177 Thanks, I've removed the patch from misc-next for now.
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index d7e8cd4f140c..7aa53058d893 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -365,6 +365,23 @@ static u64 calc_available_free_space(struct btrfs_fs_info *fs_info, factor = btrfs_bg_type_to_factor(profile); avail = div_u64(avail, factor); + /* + * Since data allocations immediately use block groups as part of the + * reservation, because we assume that data reservations will == actual + * usage, we could potentially overcommit and then immediately have that + * available space used by a data allocation, which could put us in a + * bind when we get close to filling the file system. + * + * To handle this simply remove 1G (which is our current maximum chunk + * allocation size) from the available space. If we are relatively + * empty this won't affect our ability to overcommit much, and if we're + * very close to full it'll keep us from getting into a position where + * we've given ourselves very little metadata wiggle room. + */ + if (avail < SZ_1G) + return 0; + avail -= SZ_1G; + /* * If we aren't flushing all things, let us overcommit up to * 1/2th of the space. If we can flush, don't let us overcommit
A user reported some unpleasant behavior with very small file systems. The reproducer is this mkfs.btrfs -f -m single -b 8g /dev/vdb mount /dev/vdb /mnt/test dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 This will result in usage that looks like this Overall: Device size: 8.00GiB Device allocated: 8.00GiB Device unallocated: 1.00MiB Device missing: 0.00B Device slack: 2.00GiB Used: 5.47GiB Free (estimated): 2.52GiB (min: 2.52GiB) Free (statfs, df): 0.00B Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 5.50MiB (used: 0.00B) Multiple profiles: no Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) /dev/vdb 7.99GiB Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) /dev/vdb 8.00MiB System,single: Size:4.00MiB, Used:16.00KiB (0.39%) /dev/vdb 4.00MiB Unallocated: /dev/vdb 1.00MiB As you can see we've gotten ourselves quite full with metadata, with all of the disk being allocated for data. On smaller file systems there's not a lot of time before we get full, so our overcommit behavior bites us here. Generally speaking data reservations result in chunk allocations as we assume reservation == actual use for data. This means at any point we could end up with a chunk allocation for data, and if we're very close to full we could do this before we have a chance to figure out that we need another metadata chunk. Address this by adjusting the overcommit logic. Simply put we need to take away 1 chunk from the available chunk space in case of a data reservation. This will allow us to stop overcommitting before we potentially lose this space to a data allocation. With this fix in place we properly allocate a metadata chunk before we're completely full, allowing for enough slack space in metadata. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/space-info.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+)