[1/5] btrfs: 1G falloc extents

Message ID	cace4a8be466b9c4fee288c768c5384988c1fca8.1664999303.git.boris@bur.io (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-btrfs-owner@kernel.org> Feedback-ID: i083147f8:Fastmail From: Boris Burkov <boris@bur.io> To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 1/5] btrfs: 1G falloc extents Date: Wed, 5 Oct 2022 12:49:18 -0700 Message-Id: <cace4a8be466b9c4fee288c768c5384988c1fca8.1664999303.git.boris@bur.io> In-Reply-To: <cover.1664999303.git.boris@bur.io> References: <cover.1664999303.git.boris@bur.io> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	btrfs: data block group size classes \| expand [0/5] btrfs: data block group size classes [1/5] btrfs: 1G falloc extents [2/5] btrfs: use ffe_ctl in btrfs allocator tracepoints [3/5] btrfs: add more ffe tracepoints [4/5] btrfs: introduce size class to block group allocator [5/5] btrfs: load block group size class when caching

Boris Burkov Oct. 5, 2022, 7:49 p.m. UTC

When doing a large fallocate, btrfs will break it up into 256MiB
extents. Our data block groups are 1GiB, so a more natural maximum size
is 1GiB, so that we tend to allocate and fully use block groups rather
than fragmenting the file around.

This is especially useful if large fallocates tend to be for "round"
amounts, which strikes me as a reasonable assumption.

While moving to size classes reduces the value of this change, it is
also good to compare potential allocator algorithms against just 1G
extents.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Qu Wenruo Oct. 6, 2022, 7:37 a.m. UTC | #1

On 2022/10/6 03:49, Boris Burkov wrote:
> When doing a large fallocate, btrfs will break it up into 256MiB
> extents. Our data block groups are 1GiB, so a more natural maximum size
> is 1GiB, so that we tend to allocate and fully use block groups rather
> than fragmenting the file around.
>
> This is especially useful if large fallocates tend to be for "round"
> amounts, which strikes me as a reasonable assumption.
>
> While moving to size classes reduces the value of this change, it is
> also good to compare potential allocator algorithms against just 1G
> extents.

Btrfs extent booking is already causing a lot of wasted space, is this
larger extent size really a good idea?

E.g. after a lot of random writes, we may have only a very small part of
the original 1G still being referred.
(The first write into the pre-allocated range will not be COWed, but the
next one over the same range will be COWed)

But the full 1G can only be freed if none of its sectors is referred.
Thus this would make preallocated space much harder to be free,
snapshots/reflink can make it even worse.

So wouldn't such enlarged preallocted extent size cause more pressure?

In fact, the original 256M is already too large to me.

Thanks,
Qu
>
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>   fs/btrfs/inode.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 45ebef8d3ea8..fd66586ae2fc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>   	if (trans)
>   		own_trans = false;
>   	while (num_bytes > 0) {
> -		cur_bytes = min_t(u64, num_bytes, SZ_256M);
> +		cur_bytes = min_t(u64, num_bytes, SZ_1G);
>   		cur_bytes = max(cur_bytes, min_size);
>   		/*
>   		 * If we are severely fragmented we could end up with really

Johannes Thumshirn Oct. 6, 2022, 8:48 a.m. UTC | #2

On 05.10.22 21:49, Boris Burkov wrote:
> When doing a large fallocate, btrfs will break it up into 256MiB
> extents. Our data block groups are 1GiB, so a more natural maximum size
> is 1GiB, so that we tend to allocate and fully use block groups rather
> than fragmenting the file around.
> 
> This is especially useful if large fallocates tend to be for "round"
> amounts, which strikes me as a reasonable assumption.
> 
> While moving to size classes reduces the value of this change, it is
> also good to compare potential allocator algorithms against just 1G
> extents.
> 
> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 45ebef8d3ea8..fd66586ae2fc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>  	if (trans)
>  		own_trans = false;
>  	while (num_bytes > 0) {
> -		cur_bytes = min_t(u64, num_bytes, SZ_256M);
> +		cur_bytes = min_t(u64, num_bytes, SZ_1G);
>  		cur_bytes = max(cur_bytes, min_size);
>  		/*
>  		 * If we are severely fragmented we could end up with really

Shouldn't this be space_info::chunk_size to be future proof?

Filipe Manana Oct. 6, 2022, 9:48 a.m. UTC | #3

On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/10/6 03:49, Boris Burkov wrote:
> > When doing a large fallocate, btrfs will break it up into 256MiB
> > extents. Our data block groups are 1GiB, so a more natural maximum size
> > is 1GiB, so that we tend to allocate and fully use block groups rather
> > than fragmenting the file around.
> >
> > This is especially useful if large fallocates tend to be for "round"
> > amounts, which strikes me as a reasonable assumption.
> >
> > While moving to size classes reduces the value of this change, it is
> > also good to compare potential allocator algorithms against just 1G
> > extents.
>
> Btrfs extent booking is already causing a lot of wasted space, is this
> larger extent size really a good idea?
>
> E.g. after a lot of random writes, we may have only a very small part of
> the original 1G still being referred.
> (The first write into the pre-allocated range will not be COWed, but the
> next one over the same range will be COWed)
>
> But the full 1G can only be freed if none of its sectors is referred.
> Thus this would make preallocated space much harder to be free,
> snapshots/reflink can make it even worse.
>
> So wouldn't such enlarged preallocted extent size cause more pressure?

I agree, increasing the max extent size here does not seem like a good
thing to do.

If an application fallocates space, then it generally expects to write to all
that space. However future random partial writes may not rewrite the entire
extent for a very long time, therefore making us keep a 1G extent for a very
long time (or forever in the worst case).

Even for NOCOW files, it's still an issue if snapshots are used.

>
> In fact, the original 256M is already too large to me.
>
> Thanks,
> Qu
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >   fs/btrfs/inode.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 45ebef8d3ea8..fd66586ae2fc 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> >       if (trans)
> >               own_trans = false;
> >       while (num_bytes > 0) {
> > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> >               cur_bytes = max(cur_bytes, min_size);
> >               /*
> >                * If we are severely fragmented we could end up with really

Boris Burkov Oct. 6, 2022, 6:38 p.m. UTC | #4

On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2022/10/6 03:49, Boris Burkov wrote:
> > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > than fragmenting the file around.
> > >
> > > This is especially useful if large fallocates tend to be for "round"
> > > amounts, which strikes me as a reasonable assumption.
> > >
> > > While moving to size classes reduces the value of this change, it is
> > > also good to compare potential allocator algorithms against just 1G
> > > extents.
> >
> > Btrfs extent booking is already causing a lot of wasted space, is this
> > larger extent size really a good idea?
> >
> > E.g. after a lot of random writes, we may have only a very small part of
> > the original 1G still being referred.
> > (The first write into the pre-allocated range will not be COWed, but the
> > next one over the same range will be COWed)
> >
> > But the full 1G can only be freed if none of its sectors is referred.
> > Thus this would make preallocated space much harder to be free,
> > snapshots/reflink can make it even worse.
> >
> > So wouldn't such enlarged preallocted extent size cause more pressure?
> 
> I agree, increasing the max extent size here does not seem like a good
> thing to do.
> 
> If an application fallocates space, then it generally expects to write to all
> that space. However future random partial writes may not rewrite the entire
> extent for a very long time, therefore making us keep a 1G extent for a very
> long time (or forever in the worst case).
> 
> Even for NOCOW files, it's still an issue if snapshots are used.
> 

I see your point, and agree 1GiB is worse with respect to bookend
extents. Since the patchset doesn't really rely on this, I don't mind
dropping the change. I was mostly trying to rule this out as a trivial
fix that would obviate the need for other changes.

However, I'm not completely convinced by the argument, for two reasons.

The first is essentially Qu's last comment. If you guys are right, then
256MiB is probably a really bad value for this as well, and we should be
reaping the wins of making it smaller.

The second is that I'm not convinced of how the regression is going to
happen here in practice. Let's say someone does a 2GiB falloc and writes
the file out once. In the old code that will be 8 256MiB extents, in the
new code, 2 1GiB extents. Then, to have this be a regression, the user
would have to fully overwrite one of the 256MiB extents, but not 1GiB.
Are there a lot of workloads that don't use nocow, and which randomly
overwrite all of a 256MiB extent of a larger file? Maybe..

Since I'm making the change, it's incumbent on me to prove it's safe, so
with that in mind, I would reiterate I'm fine to drop it.

> >
> > In fact, the original 256M is already too large to me.
> >
> > Thanks,
> > Qu
> > >
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >   fs/btrfs/inode.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > --- a/fs/btrfs/inode.c
> > > +++ b/fs/btrfs/inode.c
> > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > >       if (trans)
> > >               own_trans = false;
> > >       while (num_bytes > 0) {
> > > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > >               cur_bytes = max(cur_bytes, min_size);
> > >               /*
> > >                * If we are severely fragmented we could end up with really

Filipe Manana Oct. 6, 2022, 7:56 p.m. UTC | #5

On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > >
> > >
> > >
> > > On 2022/10/6 03:49, Boris Burkov wrote:
> > > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > > than fragmenting the file around.
> > > >
> > > > This is especially useful if large fallocates tend to be for "round"
> > > > amounts, which strikes me as a reasonable assumption.
> > > >
> > > > While moving to size classes reduces the value of this change, it is
> > > > also good to compare potential allocator algorithms against just 1G
> > > > extents.
> > >
> > > Btrfs extent booking is already causing a lot of wasted space, is this
> > > larger extent size really a good idea?
> > >
> > > E.g. after a lot of random writes, we may have only a very small part of
> > > the original 1G still being referred.
> > > (The first write into the pre-allocated range will not be COWed, but the
> > > next one over the same range will be COWed)
> > >
> > > But the full 1G can only be freed if none of its sectors is referred.
> > > Thus this would make preallocated space much harder to be free,
> > > snapshots/reflink can make it even worse.
> > >
> > > So wouldn't such enlarged preallocted extent size cause more pressure?
> >
> > I agree, increasing the max extent size here does not seem like a good
> > thing to do.
> >
> > If an application fallocates space, then it generally expects to write to all
> > that space. However future random partial writes may not rewrite the entire
> > extent for a very long time, therefore making us keep a 1G extent for a very
> > long time (or forever in the worst case).
> >
> > Even for NOCOW files, it's still an issue if snapshots are used.
> >
>
> I see your point, and agree 1GiB is worse with respect to bookend
> extents. Since the patchset doesn't really rely on this, I don't mind
> dropping the change. I was mostly trying to rule this out as a trivial
> fix that would obviate the need for other changes.
>
> However, I'm not completely convinced by the argument, for two reasons.
>
> The first is essentially Qu's last comment. If you guys are right, then
> 256MiB is probably a really bad value for this as well, and we should be
> reaping the wins of making it smaller.

Well, that's not a reason to increase the size, quite the opposite.

>
> The second is that I'm not convinced of how the regression is going to
> happen here in practice.

Over the years, every now and then there are users reporting that
their free space is mysteriously
going away and they have no clue why, even when not using snapshots.
More experienced users
provide help and eventually notice it's caused by many bookend
extents, and the given workaround
is to defrag the files.

You can frequently see such reports in threads on this mailing list or
in the IRC channel.

> Let's say someone does a 2GiB falloc and writes
> the file out once. In the old code that will be 8 256MiB extents, in the
> new code, 2 1GiB extents. Then, to have this be a regression, the user
> would have to fully overwrite one of the 256MiB extents, but not 1GiB.
> Are there a lot of workloads that don't use nocow, and which randomly
> overwrite all of a 256MiB extent of a larger file? Maybe..

Assuming that all or most workloads that use fallocate also set nocow
on the files, is quite a big stretch IMO.
It's true that for performance critical applications like traditional
relational databases, people usually set nocow,
or the application or some other software does it automatically for them.

But there are also many applications that use fallocate and people are
not aware of and don't set nocow, nor
anything else sets nocow automatically on the files used by them.
Specially if they are not IO intensive, in which
case they may not be noticed and therefore space wasted due to bookend
extents is more likely to happen.

Having a bunch of very small extents, say 4K to 512K, is clearly bad
compared to having just a few 256M extents.
But having a few 1G extents instead of a group of 256M extents,
probably doesn't make such a huge difference as
in the former case.

Does the 1G extents benefit some facebook specific workload, or some
well known workload?


>
> Since I'm making the change, it's incumbent on me to prove it's safe, so
> with that in mind, I would reiterate I'm fine to drop it.
>
> > >
> > > In fact, the original 256M is already too large to me.
> > >
> > > Thanks,
> > > Qu
> > > >
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >   fs/btrfs/inode.c | 2 +-
> > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > > --- a/fs/btrfs/inode.c
> > > > +++ b/fs/btrfs/inode.c
> > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > > >       if (trans)
> > > >               own_trans = false;
> > > >       while (num_bytes > 0) {
> > > > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > > >               cur_bytes = max(cur_bytes, min_size);
> > > >               /*
> > > >                * If we are severely fragmented we could end up with really

Boris Burkov Oct. 6, 2022, 8:41 p.m. UTC | #6

On Thu, Oct 06, 2022 at 08:56:03PM +0100, Filipe Manana wrote:
> On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
> > > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > > >
> > > >
> > > >
> > > > On 2022/10/6 03:49, Boris Burkov wrote:
> > > > > When doing a large fallocate, btrfs will break it up into 256MiB
> > > > > extents. Our data block groups are 1GiB, so a more natural maximum size
> > > > > is 1GiB, so that we tend to allocate and fully use block groups rather
> > > > > than fragmenting the file around.
> > > > >
> > > > > This is especially useful if large fallocates tend to be for "round"
> > > > > amounts, which strikes me as a reasonable assumption.
> > > > >
> > > > > While moving to size classes reduces the value of this change, it is
> > > > > also good to compare potential allocator algorithms against just 1G
> > > > > extents.
> > > >
> > > > Btrfs extent booking is already causing a lot of wasted space, is this
> > > > larger extent size really a good idea?
> > > >
> > > > E.g. after a lot of random writes, we may have only a very small part of
> > > > the original 1G still being referred.
> > > > (The first write into the pre-allocated range will not be COWed, but the
> > > > next one over the same range will be COWed)
> > > >
> > > > But the full 1G can only be freed if none of its sectors is referred.
> > > > Thus this would make preallocated space much harder to be free,
> > > > snapshots/reflink can make it even worse.
> > > >
> > > > So wouldn't such enlarged preallocted extent size cause more pressure?
> > >
> > > I agree, increasing the max extent size here does not seem like a good
> > > thing to do.
> > >
> > > If an application fallocates space, then it generally expects to write to all
> > > that space. However future random partial writes may not rewrite the entire
> > > extent for a very long time, therefore making us keep a 1G extent for a very
> > > long time (or forever in the worst case).
> > >
> > > Even for NOCOW files, it's still an issue if snapshots are used.
> > >
> >
> > I see your point, and agree 1GiB is worse with respect to bookend
> > extents. Since the patchset doesn't really rely on this, I don't mind
> > dropping the change. I was mostly trying to rule this out as a trivial
> > fix that would obviate the need for other changes.
> >
> > However, I'm not completely convinced by the argument, for two reasons.
> >
> > The first is essentially Qu's last comment. If you guys are right, then
> > 256MiB is probably a really bad value for this as well, and we should be
> > reaping the wins of making it smaller.
> 
> Well, that's not a reason to increase the size, quite the opposite.

Agreed. My point was more: why is it 256MiB? There is something we would
tradeoff against bookend waste (based on your later comment, I take it
to be performance for the big files). I believe that with 256MiB we have
already crossed the bookend rubicon, so we might as well accept it.

> 
> >
> > The second is that I'm not convinced of how the regression is going to
> > happen here in practice.
> 
> Over the years, every now and then there are users reporting that
> their free space is mysteriously
> going away and they have no clue why, even when not using snapshots.
> More experienced users
> provide help and eventually notice it's caused by many bookend
> extents, and the given workaround
> is to defrag the files.
> 
> You can frequently see such reports in threads on this mailing list or
> in the IRC channel.

That's not what I was asking, really. Who overwrites 256MiB but not
1GiB? Is that common? Overwriting <256MiB (or not covering an extent)
in the status quo is exactly as bad, as far as I can see.

> 
> > Let's say someone does a 2GiB falloc and writes
> > the file out once. In the old code that will be 8 256MiB extents, in the
> > new code, 2 1GiB extents. Then, to have this be a regression, the user
> > would have to fully overwrite one of the 256MiB extents, but not 1GiB.
> > Are there a lot of workloads that don't use nocow, and which randomly
> > overwrite all of a 256MiB extent of a larger file? Maybe..
> 
> Assuming that all or most workloads that use fallocate also set nocow
> on the files, is quite a big stretch IMO.

Agreed. I won't argue based on nocow.

> It's true that for performance critical applications like traditional
> relational databases, people usually set nocow,
> or the application or some other software does it automatically for them.
> 
> But there are also many applications that use fallocate and people are
> not aware of and don't set nocow, nor
> anything else sets nocow automatically on the files used by them.
> Specially if they are not IO intensive, in which
> case they may not be noticed and therefore space wasted due to bookend
> extents is more likely to happen.
> 
> Having a bunch of very small extents, say 4K to 512K, is clearly bad
> compared to having just a few 256M extents.
> But having a few 1G extents instead of a group of 256M extents,
> probably doesn't make such a huge difference as
> in the former case.
> 
> Does the 1G extents benefit some facebook specific workload, or some
> well known workload?

I observed that we have highly fragmented file systems (before
autorelocate) and that after a balance, they regress back to fragmented
in a day or so. The frontier of new block group allocations is all from
large fallocates (mostly from installing binary packages for starting or
updating services, getting configuration/data blobs, etc..)

If you do two 1GiB fallocates in a fs that looks like:
       BG1                BG2            BG3
|xxx.............|................|................|
you get something like:
|xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHH............|
then you do some random little writes:
|xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHHy...........|
then you delete the second fallocate because you're done with that
container:
|xxxxAAAABBBBCCCC|DDDD............|HHHHy...........|
then you do another small write:
|xxxxAAAABBBBCCCC|DDDDz...........|HHHHy...........|
then you fallocate the next package:
|xxxxAAAABBBBCCCC|DDDDzIIIIJJJJ...|HHHHyKKKK.......|
and so on until you blow through 100 block groups in a day.

I don't think this is particularly Facebook specific, but does seem
specific to a "installs/upgrades lots of containers" type of workload.

> 
> 
> >
> > Since I'm making the change, it's incumbent on me to prove it's safe, so
> > with that in mind, I would reiterate I'm fine to drop it.
> >
> > > >
> > > > In fact, the original 256M is already too large to me.
> > > >
> > > > Thanks,
> > > > Qu
> > > > >
> > > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > > ---
> > > > >   fs/btrfs/inode.c | 2 +-
> > > > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > > > > index 45ebef8d3ea8..fd66586ae2fc 100644
> > > > > --- a/fs/btrfs/inode.c
> > > > > +++ b/fs/btrfs/inode.c
> > > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
> > > > >       if (trans)
> > > > >               own_trans = false;
> > > > >       while (num_bytes > 0) {
> > > > > -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
> > > > > +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
> > > > >               cur_bytes = max(cur_bytes, min_size);
> > > > >               /*
> > > > >                * If we are severely fragmented we could end up with really

Qu Wenruo Oct. 6, 2022, 11:03 p.m. UTC | #7

On 2022/10/7 04:41, Boris Burkov wrote:
> On Thu, Oct 06, 2022 at 08:56:03PM +0100, Filipe Manana wrote:
>> On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote:
>>>
>>> On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote:
>>>> On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2022/10/6 03:49, Boris Burkov wrote:
>>>>>> When doing a large fallocate, btrfs will break it up into 256MiB
>>>>>> extents. Our data block groups are 1GiB, so a more natural maximum size
>>>>>> is 1GiB, so that we tend to allocate and fully use block groups rather
>>>>>> than fragmenting the file around.
>>>>>>
>>>>>> This is especially useful if large fallocates tend to be for "round"
>>>>>> amounts, which strikes me as a reasonable assumption.
>>>>>>
>>>>>> While moving to size classes reduces the value of this change, it is
>>>>>> also good to compare potential allocator algorithms against just 1G
>>>>>> extents.
>>>>>
>>>>> Btrfs extent booking is already causing a lot of wasted space, is this
>>>>> larger extent size really a good idea?
>>>>>
>>>>> E.g. after a lot of random writes, we may have only a very small part of
>>>>> the original 1G still being referred.
>>>>> (The first write into the pre-allocated range will not be COWed, but the
>>>>> next one over the same range will be COWed)
>>>>>
>>>>> But the full 1G can only be freed if none of its sectors is referred.
>>>>> Thus this would make preallocated space much harder to be free,
>>>>> snapshots/reflink can make it even worse.
>>>>>
>>>>> So wouldn't such enlarged preallocted extent size cause more pressure?
>>>>
>>>> I agree, increasing the max extent size here does not seem like a good
>>>> thing to do.
>>>>
>>>> If an application fallocates space, then it generally expects to write to all
>>>> that space. However future random partial writes may not rewrite the entire
>>>> extent for a very long time, therefore making us keep a 1G extent for a very
>>>> long time (or forever in the worst case).
>>>>
>>>> Even for NOCOW files, it's still an issue if snapshots are used.
>>>>
>>>
>>> I see your point, and agree 1GiB is worse with respect to bookend
>>> extents. Since the patchset doesn't really rely on this, I don't mind
>>> dropping the change. I was mostly trying to rule this out as a trivial
>>> fix that would obviate the need for other changes.
>>>
>>> However, I'm not completely convinced by the argument, for two reasons.
>>>
>>> The first is essentially Qu's last comment. If you guys are right, then
>>> 256MiB is probably a really bad value for this as well, and we should be
>>> reaping the wins of making it smaller.
>>
>> Well, that's not a reason to increase the size, quite the opposite.
> 
> Agreed. My point was more: why is it 256MiB? There is something we would
> tradeoff against bookend waste (based on your later comment, I take it
> to be performance for the big files). I believe that with 256MiB we have
> already crossed the bookend rubicon, so we might as well accept it.

Yep, I believe the initial 256M for reallocated file extent limits (and 
128M for regular extents) are already too large.

But I also understand that, too small limits would cause way more 
backref items, thus can also be a problem.

Thus I believe we need to discuss on the file extent size limit more deeply.

> 
>>
>>>
>>> The second is that I'm not convinced of how the regression is going to
>>> happen here in practice.
>>
>> Over the years, every now and then there are users reporting that
>> their free space is mysteriously
>> going away and they have no clue why, even when not using snapshots.
>> More experienced users
>> provide help and eventually notice it's caused by many bookend
>> extents, and the given workaround
>> is to defrag the files.
>>
>> You can frequently see such reports in threads on this mailing list or
>> in the IRC channel.
> 
> That's not what I was asking, really. Who overwrites 256MiB but not
> 1GiB? Is that common? Overwriting <256MiB (or not covering an extent)
> in the status quo is exactly as bad, as far as I can see.

The situation is not about only over-write full 256MiB but not the full 1G.

It's about chance. If there are random writes into the preallocated 
extent, the larger the extent is, the harder to reclaim the extent.
(aka, we're talking about the worst case)

Sure for random writes it's already hard to reclaim the original 256M 
extent, but it would way harder to reclaim the 1G one.

> 
>>
>>> Let's say someone does a 2GiB falloc and writes
>>> the file out once. In the old code that will be 8 256MiB extents, in the
>>> new code, 2 1GiB extents. Then, to have this be a regression, the user
>>> would have to fully overwrite one of the 256MiB extents, but not 1GiB.
>>> Are there a lot of workloads that don't use nocow, and which randomly
>>> overwrite all of a 256MiB extent of a larger file? Maybe..
>>
>> Assuming that all or most workloads that use fallocate also set nocow
>> on the files, is quite a big stretch IMO.
> 
> Agreed. I won't argue based on nocow.
> 
>> It's true that for performance critical applications like traditional
>> relational databases, people usually set nocow,
>> or the application or some other software does it automatically for them.
>>
>> But there are also many applications that use fallocate and people are
>> not aware of and don't set nocow, nor
>> anything else sets nocow automatically on the files used by them.
>> Specially if they are not IO intensive, in which
>> case they may not be noticed and therefore space wasted due to bookend
>> extents is more likely to happen.
>>
>> Having a bunch of very small extents, say 4K to 512K, is clearly bad
>> compared to having just a few 256M extents.
>> But having a few 1G extents instead of a group of 256M extents,
>> probably doesn't make such a huge difference as
>> in the former case.
>>
>> Does the 1G extents benefit some facebook specific workload, or some
>> well known workload?
> 
> I observed that we have highly fragmented file systems (before
> autorelocate) and that after a balance, they regress back to fragmented
> in a day or so. The frontier of new block group allocations is all from
> large fallocates (mostly from installing binary packages for starting or
> updating services, getting configuration/data blobs, etc..)

I guess it would be something to improve from the package 
management/container side.

For btrfs, fallocate provides way less guarantee than non-COW filesystems.
In fact, btrfs fallocate can not even ensure the following write after 
fallocate won't cause ENOSPC.
(Just fallocate, make a snapshot, then write into the original subv).

Thus would it be more effective to just delete the fallocate call from 
the package management/container code?

Thanks,
Qu

> 
> If you do two 1GiB fallocates in a fs that looks like:
>         BG1                BG2            BG3
> |xxx.............|................|................|
> you get something like:
> |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHH............|
> then you do some random little writes:
> |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHHy...........|
> then you delete the second fallocate because you're done with that
> container:
> |xxxxAAAABBBBCCCC|DDDD............|HHHHy...........|
> then you do another small write:
> |xxxxAAAABBBBCCCC|DDDDz...........|HHHHy...........|
> then you fallocate the next package:
> |xxxxAAAABBBBCCCC|DDDDzIIIIJJJJ...|HHHHyKKKK.......|
> and so on until you blow through 100 block groups in a day.
> 
> I don't think this is particularly Facebook specific, but does seem
> specific to a "installs/upgrades lots of containers" type of workload.
> 
>>
>>
>>>
>>> Since I'm making the change, it's incumbent on me to prove it's safe, so
>>> with that in mind, I would reiterate I'm fine to drop it.
>>>
>>>>>
>>>>> In fact, the original 256M is already too large to me.
>>>>>
>>>>> Thanks,
>>>>> Qu
>>>>>>
>>>>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>>>>> ---
>>>>>>    fs/btrfs/inode.c | 2 +-
>>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>>>>> index 45ebef8d3ea8..fd66586ae2fc 100644
>>>>>> --- a/fs/btrfs/inode.c
>>>>>> +++ b/fs/btrfs/inode.c
>>>>>> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>>>>>>        if (trans)
>>>>>>                own_trans = false;
>>>>>>        while (num_bytes > 0) {
>>>>>> -             cur_bytes = min_t(u64, num_bytes, SZ_256M);
>>>>>> +             cur_bytes = min_t(u64, num_bytes, SZ_1G);
>>>>>>                cur_bytes = max(cur_bytes, min_size);
>>>>>>                /*
>>>>>>                 * If we are severely fragmented we could end up with really

Wang Yugui Oct. 7, 2022, 3:23 a.m. UTC | #8

Hi,

> When doing a large fallocate, btrfs will break it up into 256MiB
> extents. Our data block groups are 1GiB, so a more natural maximum size
> is 1GiB, so that we tend to allocate and fully use block groups rather
> than fragmenting the file around.
> 
> This is especially useful if large fallocates tend to be for "round"
> amounts, which strikes me as a reasonable assumption.
> 
> While moving to size classes reduces the value of this change, it is
> also good to compare potential allocator algorithms against just 1G
> extents.


I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'.

dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt

When write to a btrfs filesystem
+ xfs_io -c fiemap /mnt/test/dd.txt
/mnt/test/dd.txt:
        0: [0..262143]: 6883584..7145727
        1: [262144..524287]: 6367232..6629375
        2: [524288..8126463]: 7145728..14747903
        3: [8126464..8388607]: 15272064..15534207
        4: [8388608..8650751]: 14755840..15017983
        5: [8650752..16252927]: 15534208..23136383
        6: [16252928..67108863]: 23144448..74000383

When write to a xfs filesystem
+ xfs_io -c fiemap /mnt/test/dd.txt
/mnt/test/dd.txt:
        0: [0..16465919]: 256..16466175
        1: [16465920..31821623]: 16466176..31821879
        2: [31821624..41942903]: 31821880..41943159
        3: [41942904..58720111]: 47183872..63961079
        4: [58720112..67108863]: 63961080..72349831

the max of xfs is about 8G, but the max of btrfs is 
about 25G('6: [16252928..67108863]'?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/10/07



> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 45ebef8d3ea8..fd66586ae2fc 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>  	if (trans)
>  		own_trans = false;
>  	while (num_bytes > 0) {
> -		cur_bytes = min_t(u64, num_bytes, SZ_256M);
> +		cur_bytes = min_t(u64, num_bytes, SZ_1G);
>  		cur_bytes = max(cur_bytes, min_size);
>  		/*
>  		 * If we are severely fragmented we could end up with really
> -- 
> 2.37.2

Qu Wenruo Oct. 7, 2022, 3:29 a.m. UTC | #9

On 2022/10/7 11:23, Wang Yugui wrote:
> Hi,
>
>> When doing a large fallocate, btrfs will break it up into 256MiB
>> extents. Our data block groups are 1GiB, so a more natural maximum size
>> is 1GiB, so that we tend to allocate and fully use block groups rather
>> than fragmenting the file around.
>>
>> This is especially useful if large fallocates tend to be for "round"
>> amounts, which strikes me as a reasonable assumption.
>>
>> While moving to size classes reduces the value of this change, it is
>> also good to compare potential allocator algorithms against just 1G
>> extents.
>
>
> I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'.
>
> dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt
>
> When write to a btrfs filesystem
> + xfs_io -c fiemap /mnt/test/dd.txt
> /mnt/test/dd.txt:
>          0: [0..262143]: 6883584..7145727
>          1: [262144..524287]: 6367232..6629375
>          2: [524288..8126463]: 7145728..14747903
>          3: [8126464..8388607]: 15272064..15534207
>          4: [8388608..8650751]: 14755840..15017983
>          5: [8650752..16252927]: 15534208..23136383
>          6: [16252928..67108863]: 23144448..74000383
>
> When write to a xfs filesystem
> + xfs_io -c fiemap /mnt/test/dd.txt
> /mnt/test/dd.txt:
>          0: [0..16465919]: 256..16466175
>          1: [16465920..31821623]: 16466176..31821879
>          2: [31821624..41942903]: 31821880..41943159
>          3: [41942904..58720111]: 47183872..63961079
>          4: [58720112..67108863]: 63961080..72349831

Don't waste your time on fiemap output. Btrfs can do merge fie result at
write time.

# dd if=/dev/zero  bs=4k count=1024 oflag=sync of=/mnt/btrfs/file

Above command should result all file extents to be 4K, and can be easily
verified through dump tree.

	item 197 key (258 EXTENT_DATA 4186112) itemoff 5789 itemsize 53
		generation 8 type 1 (regular)
		extent data disk byte 22011904 nr 4096
		extent data offset 0 nr 4096 ram 4096

But fiemap will only report one single extent:

# xfs_io -c "fiemap -v" /mnt/btrfs/file
/mnt/btrfs/file:
  EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
    0: [0..8191]:       34816..43007      8192   0x1

>
> the max of xfs is about 8G, but the max of btrfs is
> about 25G('6: [16252928..67108863]'?
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/10/07
>
>
>
>> Signed-off-by: Boris Burkov <boris@bur.io>
>> ---
>>   fs/btrfs/inode.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index 45ebef8d3ea8..fd66586ae2fc 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode,
>>   	if (trans)
>>   		own_trans = false;
>>   	while (num_bytes > 0) {
>> -		cur_bytes = min_t(u64, num_bytes, SZ_256M);
>> +		cur_bytes = min_t(u64, num_bytes, SZ_1G);
>>   		cur_bytes = max(cur_bytes, min_size);
>>   		/*
>>   		 * If we are severely fragmented we could end up with really
>> --
>> 2.37.2
>
>

Qu Wenruo Oct. 7, 2022, 3:40 a.m. UTC | #10

On 2022/10/7 11:29, Qu Wenruo wrote:
>
>
> On 2022/10/7 11:23, Wang Yugui wrote:
>> Hi,
>>
>>> When doing a large fallocate, btrfs will break it up into 256MiB
>>> extents. Our data block groups are 1GiB, so a more natural maximum size
>>> is 1GiB, so that we tend to allocate and fully use block groups rather
>>> than fragmenting the file around.
>>>
>>> This is especially useful if large fallocates tend to be for "round"
>>> amounts, which strikes me as a reasonable assumption.
>>>
>>> While moving to size classes reduces the value of this change, it is
>>> also good to compare potential allocator algorithms against just 1G
>>> extents.
>>
>>
>> I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'.
>>
>> dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt
>>
>> When write to a btrfs filesystem
>> + xfs_io -c fiemap /mnt/test/dd.txt
>> /mnt/test/dd.txt:
>>          0: [0..262143]: 6883584..7145727
>>          1: [262144..524287]: 6367232..6629375
>>          2: [524288..8126463]: 7145728..14747903
>>          3: [8126464..8388607]: 15272064..15534207
>>          4: [8388608..8650751]: 14755840..15017983
>>          5: [8650752..16252927]: 15534208..23136383
>>          6: [16252928..67108863]: 23144448..74000383
>>
>> When write to a xfs filesystem
>> + xfs_io -c fiemap /mnt/test/dd.txt
>> /mnt/test/dd.txt:
>>          0: [0..16465919]: 256..16466175
>>          1: [16465920..31821623]: 16466176..31821879
>>          2: [31821624..41942903]: 31821880..41943159
>>          3: [41942904..58720111]: 47183872..63961079
>>          4: [58720112..67108863]: 63961080..72349831
>
> Don't waste your time on fiemap output. Btrfs can do merge fie result at
> write time.

s/write/fiemap emit/

>
> # dd if=/dev/zero  bs=4k count=1024 oflag=sync of=/mnt/btrfs/file
>
> Above command should result all file extents to be 4K, and can be easily
> verified through dump tree.
>
>      item 197 key (258 EXTENT_DATA 4186112) itemoff 5789 itemsize 53
>          generation 8 type 1 (regular)
>          extent data disk byte 22011904 nr 4096
>          extent data offset 0 nr 4096 ram 4096
>
> But fiemap will only report one single extent:
>
> # xfs_io -c "fiemap -v" /mnt/btrfs/file
> /mnt/btrfs/file:
>   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
>     0: [0..8191]:       34816..43007      8192   0x1
>
>>
>> the max of xfs is about 8G, but the max of btrfs is
>> about 25G('6: [16252928..67108863]'?
>>
>> Best Regards
>> Wang Yugui (wangyugui@e16-tech.com)
>> 2022/10/07
>>
>>
>>
>>> Signed-off-by: Boris Burkov <boris@bur.io>
>>> ---
>>>   fs/btrfs/inode.c | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>>> index 45ebef8d3ea8..fd66586ae2fc 100644
>>> --- a/fs/btrfs/inode.c
>>> +++ b/fs/btrfs/inode.c
>>> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct
>>> inode *inode, int mode,
>>>       if (trans)
>>>           own_trans = false;
>>>       while (num_bytes > 0) {
>>> -        cur_bytes = min_t(u64, num_bytes, SZ_256M);
>>> +        cur_bytes = min_t(u64, num_bytes, SZ_1G);
>>>           cur_bytes = max(cur_bytes, min_size);
>>>           /*
>>>            * If we are severely fragmented we could end up with really
>>> --
>>> 2.37.2
>>
>>

[1/5] btrfs: 1G falloc extents

Commit Message

Comments

Patch