Message ID | cace4a8be466b9c4fee288c768c5384988c1fca8.1664999303.git.boris@bur.io (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | btrfs: data block group size classes | expand |
On 2022/10/6 03:49, Boris Burkov wrote: > When doing a large fallocate, btrfs will break it up into 256MiB > extents. Our data block groups are 1GiB, so a more natural maximum size > is 1GiB, so that we tend to allocate and fully use block groups rather > than fragmenting the file around. > > This is especially useful if large fallocates tend to be for "round" > amounts, which strikes me as a reasonable assumption. > > While moving to size classes reduces the value of this change, it is > also good to compare potential allocator algorithms against just 1G > extents. Btrfs extent booking is already causing a lot of wasted space, is this larger extent size really a good idea? E.g. after a lot of random writes, we may have only a very small part of the original 1G still being referred. (The first write into the pre-allocated range will not be COWed, but the next one over the same range will be COWed) But the full 1G can only be freed if none of its sectors is referred. Thus this would make preallocated space much harder to be free, snapshots/reflink can make it even worse. So wouldn't such enlarged preallocted extent size cause more pressure? In fact, the original 256M is already too large to me. Thanks, Qu > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 45ebef8d3ea8..fd66586ae2fc 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > if (trans) > own_trans = false; > while (num_bytes > 0) { > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > cur_bytes = max(cur_bytes, min_size); > /* > * If we are severely fragmented we could end up with really
On 05.10.22 21:49, Boris Burkov wrote: > When doing a large fallocate, btrfs will break it up into 256MiB > extents. Our data block groups are 1GiB, so a more natural maximum size > is 1GiB, so that we tend to allocate and fully use block groups rather > than fragmenting the file around. > > This is especially useful if large fallocates tend to be for "round" > amounts, which strikes me as a reasonable assumption. > > While moving to size classes reduces the value of this change, it is > also good to compare potential allocator algorithms against just 1G > extents. > > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 45ebef8d3ea8..fd66586ae2fc 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > if (trans) > own_trans = false; > while (num_bytes > 0) { > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > cur_bytes = max(cur_bytes, min_size); > /* > * If we are severely fragmented we could end up with really Shouldn't this be space_info::chunk_size to be future proof?
On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > On 2022/10/6 03:49, Boris Burkov wrote: > > When doing a large fallocate, btrfs will break it up into 256MiB > > extents. Our data block groups are 1GiB, so a more natural maximum size > > is 1GiB, so that we tend to allocate and fully use block groups rather > > than fragmenting the file around. > > > > This is especially useful if large fallocates tend to be for "round" > > amounts, which strikes me as a reasonable assumption. > > > > While moving to size classes reduces the value of this change, it is > > also good to compare potential allocator algorithms against just 1G > > extents. > > Btrfs extent booking is already causing a lot of wasted space, is this > larger extent size really a good idea? > > E.g. after a lot of random writes, we may have only a very small part of > the original 1G still being referred. > (The first write into the pre-allocated range will not be COWed, but the > next one over the same range will be COWed) > > But the full 1G can only be freed if none of its sectors is referred. > Thus this would make preallocated space much harder to be free, > snapshots/reflink can make it even worse. > > So wouldn't such enlarged preallocted extent size cause more pressure? I agree, increasing the max extent size here does not seem like a good thing to do. If an application fallocates space, then it generally expects to write to all that space. However future random partial writes may not rewrite the entire extent for a very long time, therefore making us keep a 1G extent for a very long time (or forever in the worst case). Even for NOCOW files, it's still an issue if snapshots are used. > > In fact, the original 256M is already too large to me. > > Thanks, > Qu > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > --- > > fs/btrfs/inode.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > index 45ebef8d3ea8..fd66586ae2fc 100644 > > --- a/fs/btrfs/inode.c > > +++ b/fs/btrfs/inode.c > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > > if (trans) > > own_trans = false; > > while (num_bytes > 0) { > > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > > cur_bytes = max(cur_bytes, min_size); > > /* > > * If we are severely fragmented we could end up with really
On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote: > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > > > > > On 2022/10/6 03:49, Boris Burkov wrote: > > > When doing a large fallocate, btrfs will break it up into 256MiB > > > extents. Our data block groups are 1GiB, so a more natural maximum size > > > is 1GiB, so that we tend to allocate and fully use block groups rather > > > than fragmenting the file around. > > > > > > This is especially useful if large fallocates tend to be for "round" > > > amounts, which strikes me as a reasonable assumption. > > > > > > While moving to size classes reduces the value of this change, it is > > > also good to compare potential allocator algorithms against just 1G > > > extents. > > > > Btrfs extent booking is already causing a lot of wasted space, is this > > larger extent size really a good idea? > > > > E.g. after a lot of random writes, we may have only a very small part of > > the original 1G still being referred. > > (The first write into the pre-allocated range will not be COWed, but the > > next one over the same range will be COWed) > > > > But the full 1G can only be freed if none of its sectors is referred. > > Thus this would make preallocated space much harder to be free, > > snapshots/reflink can make it even worse. > > > > So wouldn't such enlarged preallocted extent size cause more pressure? > > I agree, increasing the max extent size here does not seem like a good > thing to do. > > If an application fallocates space, then it generally expects to write to all > that space. However future random partial writes may not rewrite the entire > extent for a very long time, therefore making us keep a 1G extent for a very > long time (or forever in the worst case). > > Even for NOCOW files, it's still an issue if snapshots are used. > I see your point, and agree 1GiB is worse with respect to bookend extents. Since the patchset doesn't really rely on this, I don't mind dropping the change. I was mostly trying to rule this out as a trivial fix that would obviate the need for other changes. However, I'm not completely convinced by the argument, for two reasons. The first is essentially Qu's last comment. If you guys are right, then 256MiB is probably a really bad value for this as well, and we should be reaping the wins of making it smaller. The second is that I'm not convinced of how the regression is going to happen here in practice. Let's say someone does a 2GiB falloc and writes the file out once. In the old code that will be 8 256MiB extents, in the new code, 2 1GiB extents. Then, to have this be a regression, the user would have to fully overwrite one of the 256MiB extents, but not 1GiB. Are there a lot of workloads that don't use nocow, and which randomly overwrite all of a 256MiB extent of a larger file? Maybe.. Since I'm making the change, it's incumbent on me to prove it's safe, so with that in mind, I would reiterate I'm fine to drop it. > > > > In fact, the original 256M is already too large to me. > > > > Thanks, > > Qu > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > --- > > > fs/btrfs/inode.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > > index 45ebef8d3ea8..fd66586ae2fc 100644 > > > --- a/fs/btrfs/inode.c > > > +++ b/fs/btrfs/inode.c > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > > > if (trans) > > > own_trans = false; > > > while (num_bytes > 0) { > > > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > > > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > > > cur_bytes = max(cur_bytes, min_size); > > > /* > > > * If we are severely fragmented we could end up with really
On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote: > > On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote: > > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > > > > > > > > > On 2022/10/6 03:49, Boris Burkov wrote: > > > > When doing a large fallocate, btrfs will break it up into 256MiB > > > > extents. Our data block groups are 1GiB, so a more natural maximum size > > > > is 1GiB, so that we tend to allocate and fully use block groups rather > > > > than fragmenting the file around. > > > > > > > > This is especially useful if large fallocates tend to be for "round" > > > > amounts, which strikes me as a reasonable assumption. > > > > > > > > While moving to size classes reduces the value of this change, it is > > > > also good to compare potential allocator algorithms against just 1G > > > > extents. > > > > > > Btrfs extent booking is already causing a lot of wasted space, is this > > > larger extent size really a good idea? > > > > > > E.g. after a lot of random writes, we may have only a very small part of > > > the original 1G still being referred. > > > (The first write into the pre-allocated range will not be COWed, but the > > > next one over the same range will be COWed) > > > > > > But the full 1G can only be freed if none of its sectors is referred. > > > Thus this would make preallocated space much harder to be free, > > > snapshots/reflink can make it even worse. > > > > > > So wouldn't such enlarged preallocted extent size cause more pressure? > > > > I agree, increasing the max extent size here does not seem like a good > > thing to do. > > > > If an application fallocates space, then it generally expects to write to all > > that space. However future random partial writes may not rewrite the entire > > extent for a very long time, therefore making us keep a 1G extent for a very > > long time (or forever in the worst case). > > > > Even for NOCOW files, it's still an issue if snapshots are used. > > > > I see your point, and agree 1GiB is worse with respect to bookend > extents. Since the patchset doesn't really rely on this, I don't mind > dropping the change. I was mostly trying to rule this out as a trivial > fix that would obviate the need for other changes. > > However, I'm not completely convinced by the argument, for two reasons. > > The first is essentially Qu's last comment. If you guys are right, then > 256MiB is probably a really bad value for this as well, and we should be > reaping the wins of making it smaller. Well, that's not a reason to increase the size, quite the opposite. > > The second is that I'm not convinced of how the regression is going to > happen here in practice. Over the years, every now and then there are users reporting that their free space is mysteriously going away and they have no clue why, even when not using snapshots. More experienced users provide help and eventually notice it's caused by many bookend extents, and the given workaround is to defrag the files. You can frequently see such reports in threads on this mailing list or in the IRC channel. > Let's say someone does a 2GiB falloc and writes > the file out once. In the old code that will be 8 256MiB extents, in the > new code, 2 1GiB extents. Then, to have this be a regression, the user > would have to fully overwrite one of the 256MiB extents, but not 1GiB. > Are there a lot of workloads that don't use nocow, and which randomly > overwrite all of a 256MiB extent of a larger file? Maybe.. Assuming that all or most workloads that use fallocate also set nocow on the files, is quite a big stretch IMO. It's true that for performance critical applications like traditional relational databases, people usually set nocow, or the application or some other software does it automatically for them. But there are also many applications that use fallocate and people are not aware of and don't set nocow, nor anything else sets nocow automatically on the files used by them. Specially if they are not IO intensive, in which case they may not be noticed and therefore space wasted due to bookend extents is more likely to happen. Having a bunch of very small extents, say 4K to 512K, is clearly bad compared to having just a few 256M extents. But having a few 1G extents instead of a group of 256M extents, probably doesn't make such a huge difference as in the former case. Does the 1G extents benefit some facebook specific workload, or some well known workload? > > Since I'm making the change, it's incumbent on me to prove it's safe, so > with that in mind, I would reiterate I'm fine to drop it. > > > > > > > In fact, the original 256M is already too large to me. > > > > > > Thanks, > > > Qu > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > --- > > > > fs/btrfs/inode.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > > > index 45ebef8d3ea8..fd66586ae2fc 100644 > > > > --- a/fs/btrfs/inode.c > > > > +++ b/fs/btrfs/inode.c > > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > > > > if (trans) > > > > own_trans = false; > > > > while (num_bytes > 0) { > > > > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > > > > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > > > > cur_bytes = max(cur_bytes, min_size); > > > > /* > > > > * If we are severely fragmented we could end up with really
On Thu, Oct 06, 2022 at 08:56:03PM +0100, Filipe Manana wrote: > On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote: > > > > On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote: > > > On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: > > > > > > > > > > > > > > > > On 2022/10/6 03:49, Boris Burkov wrote: > > > > > When doing a large fallocate, btrfs will break it up into 256MiB > > > > > extents. Our data block groups are 1GiB, so a more natural maximum size > > > > > is 1GiB, so that we tend to allocate and fully use block groups rather > > > > > than fragmenting the file around. > > > > > > > > > > This is especially useful if large fallocates tend to be for "round" > > > > > amounts, which strikes me as a reasonable assumption. > > > > > > > > > > While moving to size classes reduces the value of this change, it is > > > > > also good to compare potential allocator algorithms against just 1G > > > > > extents. > > > > > > > > Btrfs extent booking is already causing a lot of wasted space, is this > > > > larger extent size really a good idea? > > > > > > > > E.g. after a lot of random writes, we may have only a very small part of > > > > the original 1G still being referred. > > > > (The first write into the pre-allocated range will not be COWed, but the > > > > next one over the same range will be COWed) > > > > > > > > But the full 1G can only be freed if none of its sectors is referred. > > > > Thus this would make preallocated space much harder to be free, > > > > snapshots/reflink can make it even worse. > > > > > > > > So wouldn't such enlarged preallocted extent size cause more pressure? > > > > > > I agree, increasing the max extent size here does not seem like a good > > > thing to do. > > > > > > If an application fallocates space, then it generally expects to write to all > > > that space. However future random partial writes may not rewrite the entire > > > extent for a very long time, therefore making us keep a 1G extent for a very > > > long time (or forever in the worst case). > > > > > > Even for NOCOW files, it's still an issue if snapshots are used. > > > > > > > I see your point, and agree 1GiB is worse with respect to bookend > > extents. Since the patchset doesn't really rely on this, I don't mind > > dropping the change. I was mostly trying to rule this out as a trivial > > fix that would obviate the need for other changes. > > > > However, I'm not completely convinced by the argument, for two reasons. > > > > The first is essentially Qu's last comment. If you guys are right, then > > 256MiB is probably a really bad value for this as well, and we should be > > reaping the wins of making it smaller. > > Well, that's not a reason to increase the size, quite the opposite. Agreed. My point was more: why is it 256MiB? There is something we would tradeoff against bookend waste (based on your later comment, I take it to be performance for the big files). I believe that with 256MiB we have already crossed the bookend rubicon, so we might as well accept it. > > > > > The second is that I'm not convinced of how the regression is going to > > happen here in practice. > > Over the years, every now and then there are users reporting that > their free space is mysteriously > going away and they have no clue why, even when not using snapshots. > More experienced users > provide help and eventually notice it's caused by many bookend > extents, and the given workaround > is to defrag the files. > > You can frequently see such reports in threads on this mailing list or > in the IRC channel. That's not what I was asking, really. Who overwrites 256MiB but not 1GiB? Is that common? Overwriting <256MiB (or not covering an extent) in the status quo is exactly as bad, as far as I can see. > > > Let's say someone does a 2GiB falloc and writes > > the file out once. In the old code that will be 8 256MiB extents, in the > > new code, 2 1GiB extents. Then, to have this be a regression, the user > > would have to fully overwrite one of the 256MiB extents, but not 1GiB. > > Are there a lot of workloads that don't use nocow, and which randomly > > overwrite all of a 256MiB extent of a larger file? Maybe.. > > Assuming that all or most workloads that use fallocate also set nocow > on the files, is quite a big stretch IMO. Agreed. I won't argue based on nocow. > It's true that for performance critical applications like traditional > relational databases, people usually set nocow, > or the application or some other software does it automatically for them. > > But there are also many applications that use fallocate and people are > not aware of and don't set nocow, nor > anything else sets nocow automatically on the files used by them. > Specially if they are not IO intensive, in which > case they may not be noticed and therefore space wasted due to bookend > extents is more likely to happen. > > Having a bunch of very small extents, say 4K to 512K, is clearly bad > compared to having just a few 256M extents. > But having a few 1G extents instead of a group of 256M extents, > probably doesn't make such a huge difference as > in the former case. > > Does the 1G extents benefit some facebook specific workload, or some > well known workload? I observed that we have highly fragmented file systems (before autorelocate) and that after a balance, they regress back to fragmented in a day or so. The frontier of new block group allocations is all from large fallocates (mostly from installing binary packages for starting or updating services, getting configuration/data blobs, etc..) If you do two 1GiB fallocates in a fs that looks like: BG1 BG2 BG3 |xxx.............|................|................| you get something like: |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHH............| then you do some random little writes: |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHHy...........| then you delete the second fallocate because you're done with that container: |xxxxAAAABBBBCCCC|DDDD............|HHHHy...........| then you do another small write: |xxxxAAAABBBBCCCC|DDDDz...........|HHHHy...........| then you fallocate the next package: |xxxxAAAABBBBCCCC|DDDDzIIIIJJJJ...|HHHHyKKKK.......| and so on until you blow through 100 block groups in a day. I don't think this is particularly Facebook specific, but does seem specific to a "installs/upgrades lots of containers" type of workload. > > > > > > Since I'm making the change, it's incumbent on me to prove it's safe, so > > with that in mind, I would reiterate I'm fine to drop it. > > > > > > > > > > In fact, the original 256M is already too large to me. > > > > > > > > Thanks, > > > > Qu > > > > > > > > > > Signed-off-by: Boris Burkov <boris@bur.io> > > > > > --- > > > > > fs/btrfs/inode.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > > > > > index 45ebef8d3ea8..fd66586ae2fc 100644 > > > > > --- a/fs/btrfs/inode.c > > > > > +++ b/fs/btrfs/inode.c > > > > > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > > > > > if (trans) > > > > > own_trans = false; > > > > > while (num_bytes > 0) { > > > > > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > > > > > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > > > > > cur_bytes = max(cur_bytes, min_size); > > > > > /* > > > > > * If we are severely fragmented we could end up with really
On 2022/10/7 04:41, Boris Burkov wrote: > On Thu, Oct 06, 2022 at 08:56:03PM +0100, Filipe Manana wrote: >> On Thu, Oct 6, 2022 at 7:38 PM Boris Burkov <boris@bur.io> wrote: >>> >>> On Thu, Oct 06, 2022 at 10:48:33AM +0100, Filipe Manana wrote: >>>> On Thu, Oct 6, 2022 at 9:06 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote: >>>>> >>>>> >>>>> >>>>> On 2022/10/6 03:49, Boris Burkov wrote: >>>>>> When doing a large fallocate, btrfs will break it up into 256MiB >>>>>> extents. Our data block groups are 1GiB, so a more natural maximum size >>>>>> is 1GiB, so that we tend to allocate and fully use block groups rather >>>>>> than fragmenting the file around. >>>>>> >>>>>> This is especially useful if large fallocates tend to be for "round" >>>>>> amounts, which strikes me as a reasonable assumption. >>>>>> >>>>>> While moving to size classes reduces the value of this change, it is >>>>>> also good to compare potential allocator algorithms against just 1G >>>>>> extents. >>>>> >>>>> Btrfs extent booking is already causing a lot of wasted space, is this >>>>> larger extent size really a good idea? >>>>> >>>>> E.g. after a lot of random writes, we may have only a very small part of >>>>> the original 1G still being referred. >>>>> (The first write into the pre-allocated range will not be COWed, but the >>>>> next one over the same range will be COWed) >>>>> >>>>> But the full 1G can only be freed if none of its sectors is referred. >>>>> Thus this would make preallocated space much harder to be free, >>>>> snapshots/reflink can make it even worse. >>>>> >>>>> So wouldn't such enlarged preallocted extent size cause more pressure? >>>> >>>> I agree, increasing the max extent size here does not seem like a good >>>> thing to do. >>>> >>>> If an application fallocates space, then it generally expects to write to all >>>> that space. However future random partial writes may not rewrite the entire >>>> extent for a very long time, therefore making us keep a 1G extent for a very >>>> long time (or forever in the worst case). >>>> >>>> Even for NOCOW files, it's still an issue if snapshots are used. >>>> >>> >>> I see your point, and agree 1GiB is worse with respect to bookend >>> extents. Since the patchset doesn't really rely on this, I don't mind >>> dropping the change. I was mostly trying to rule this out as a trivial >>> fix that would obviate the need for other changes. >>> >>> However, I'm not completely convinced by the argument, for two reasons. >>> >>> The first is essentially Qu's last comment. If you guys are right, then >>> 256MiB is probably a really bad value for this as well, and we should be >>> reaping the wins of making it smaller. >> >> Well, that's not a reason to increase the size, quite the opposite. > > Agreed. My point was more: why is it 256MiB? There is something we would > tradeoff against bookend waste (based on your later comment, I take it > to be performance for the big files). I believe that with 256MiB we have > already crossed the bookend rubicon, so we might as well accept it. Yep, I believe the initial 256M for reallocated file extent limits (and 128M for regular extents) are already too large. But I also understand that, too small limits would cause way more backref items, thus can also be a problem. Thus I believe we need to discuss on the file extent size limit more deeply. > >> >>> >>> The second is that I'm not convinced of how the regression is going to >>> happen here in practice. >> >> Over the years, every now and then there are users reporting that >> their free space is mysteriously >> going away and they have no clue why, even when not using snapshots. >> More experienced users >> provide help and eventually notice it's caused by many bookend >> extents, and the given workaround >> is to defrag the files. >> >> You can frequently see such reports in threads on this mailing list or >> in the IRC channel. > > That's not what I was asking, really. Who overwrites 256MiB but not > 1GiB? Is that common? Overwriting <256MiB (or not covering an extent) > in the status quo is exactly as bad, as far as I can see. The situation is not about only over-write full 256MiB but not the full 1G. It's about chance. If there are random writes into the preallocated extent, the larger the extent is, the harder to reclaim the extent. (aka, we're talking about the worst case) Sure for random writes it's already hard to reclaim the original 256M extent, but it would way harder to reclaim the 1G one. > >> >>> Let's say someone does a 2GiB falloc and writes >>> the file out once. In the old code that will be 8 256MiB extents, in the >>> new code, 2 1GiB extents. Then, to have this be a regression, the user >>> would have to fully overwrite one of the 256MiB extents, but not 1GiB. >>> Are there a lot of workloads that don't use nocow, and which randomly >>> overwrite all of a 256MiB extent of a larger file? Maybe.. >> >> Assuming that all or most workloads that use fallocate also set nocow >> on the files, is quite a big stretch IMO. > > Agreed. I won't argue based on nocow. > >> It's true that for performance critical applications like traditional >> relational databases, people usually set nocow, >> or the application or some other software does it automatically for them. >> >> But there are also many applications that use fallocate and people are >> not aware of and don't set nocow, nor >> anything else sets nocow automatically on the files used by them. >> Specially if they are not IO intensive, in which >> case they may not be noticed and therefore space wasted due to bookend >> extents is more likely to happen. >> >> Having a bunch of very small extents, say 4K to 512K, is clearly bad >> compared to having just a few 256M extents. >> But having a few 1G extents instead of a group of 256M extents, >> probably doesn't make such a huge difference as >> in the former case. >> >> Does the 1G extents benefit some facebook specific workload, or some >> well known workload? > > I observed that we have highly fragmented file systems (before > autorelocate) and that after a balance, they regress back to fragmented > in a day or so. The frontier of new block group allocations is all from > large fallocates (mostly from installing binary packages for starting or > updating services, getting configuration/data blobs, etc..) I guess it would be something to improve from the package management/container side. For btrfs, fallocate provides way less guarantee than non-COW filesystems. In fact, btrfs fallocate can not even ensure the following write after fallocate won't cause ENOSPC. (Just fallocate, make a snapshot, then write into the original subv). Thus would it be more effective to just delete the fallocate call from the package management/container code? Thanks, Qu > > If you do two 1GiB fallocates in a fs that looks like: > BG1 BG2 BG3 > |xxx.............|................|................| > you get something like: > |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHH............| > then you do some random little writes: > |xxxxAAAABBBBCCCC|DDDDEEEEFFFFGGGG|HHHHy...........| > then you delete the second fallocate because you're done with that > container: > |xxxxAAAABBBBCCCC|DDDD............|HHHHy...........| > then you do another small write: > |xxxxAAAABBBBCCCC|DDDDz...........|HHHHy...........| > then you fallocate the next package: > |xxxxAAAABBBBCCCC|DDDDzIIIIJJJJ...|HHHHyKKKK.......| > and so on until you blow through 100 block groups in a day. > > I don't think this is particularly Facebook specific, but does seem > specific to a "installs/upgrades lots of containers" type of workload. > >> >> >>> >>> Since I'm making the change, it's incumbent on me to prove it's safe, so >>> with that in mind, I would reiterate I'm fine to drop it. >>> >>>>> >>>>> In fact, the original 256M is already too large to me. >>>>> >>>>> Thanks, >>>>> Qu >>>>>> >>>>>> Signed-off-by: Boris Burkov <boris@bur.io> >>>>>> --- >>>>>> fs/btrfs/inode.c | 2 +- >>>>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>>>>> index 45ebef8d3ea8..fd66586ae2fc 100644 >>>>>> --- a/fs/btrfs/inode.c >>>>>> +++ b/fs/btrfs/inode.c >>>>>> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, >>>>>> if (trans) >>>>>> own_trans = false; >>>>>> while (num_bytes > 0) { >>>>>> - cur_bytes = min_t(u64, num_bytes, SZ_256M); >>>>>> + cur_bytes = min_t(u64, num_bytes, SZ_1G); >>>>>> cur_bytes = max(cur_bytes, min_size); >>>>>> /* >>>>>> * If we are severely fragmented we could end up with really
Hi, > When doing a large fallocate, btrfs will break it up into 256MiB > extents. Our data block groups are 1GiB, so a more natural maximum size > is 1GiB, so that we tend to allocate and fully use block groups rather > than fragmenting the file around. > > This is especially useful if large fallocates tend to be for "round" > amounts, which strikes me as a reasonable assumption. > > While moving to size classes reduces the value of this change, it is > also good to compare potential allocator algorithms against just 1G > extents. I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'. dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt When write to a btrfs filesystem + xfs_io -c fiemap /mnt/test/dd.txt /mnt/test/dd.txt: 0: [0..262143]: 6883584..7145727 1: [262144..524287]: 6367232..6629375 2: [524288..8126463]: 7145728..14747903 3: [8126464..8388607]: 15272064..15534207 4: [8388608..8650751]: 14755840..15017983 5: [8650752..16252927]: 15534208..23136383 6: [16252928..67108863]: 23144448..74000383 When write to a xfs filesystem + xfs_io -c fiemap /mnt/test/dd.txt /mnt/test/dd.txt: 0: [0..16465919]: 256..16466175 1: [16465920..31821623]: 16466176..31821879 2: [31821624..41942903]: 31821880..41943159 3: [41942904..58720111]: 47183872..63961079 4: [58720112..67108863]: 63961080..72349831 the max of xfs is about 8G, but the max of btrfs is about 25G('6: [16252928..67108863]'? Best Regards Wang Yugui (wangyugui@e16-tech.com) 2022/10/07 > Signed-off-by: Boris Burkov <boris@bur.io> > --- > fs/btrfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 45ebef8d3ea8..fd66586ae2fc 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, > if (trans) > own_trans = false; > while (num_bytes > 0) { > - cur_bytes = min_t(u64, num_bytes, SZ_256M); > + cur_bytes = min_t(u64, num_bytes, SZ_1G); > cur_bytes = max(cur_bytes, min_size); > /* > * If we are severely fragmented we could end up with really > -- > 2.37.2
On 2022/10/7 11:23, Wang Yugui wrote: > Hi, > >> When doing a large fallocate, btrfs will break it up into 256MiB >> extents. Our data block groups are 1GiB, so a more natural maximum size >> is 1GiB, so that we tend to allocate and fully use block groups rather >> than fragmenting the file around. >> >> This is especially useful if large fallocates tend to be for "round" >> amounts, which strikes me as a reasonable assumption. >> >> While moving to size classes reduces the value of this change, it is >> also good to compare potential allocator algorithms against just 1G >> extents. > > > I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'. > > dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt > > When write to a btrfs filesystem > + xfs_io -c fiemap /mnt/test/dd.txt > /mnt/test/dd.txt: > 0: [0..262143]: 6883584..7145727 > 1: [262144..524287]: 6367232..6629375 > 2: [524288..8126463]: 7145728..14747903 > 3: [8126464..8388607]: 15272064..15534207 > 4: [8388608..8650751]: 14755840..15017983 > 5: [8650752..16252927]: 15534208..23136383 > 6: [16252928..67108863]: 23144448..74000383 > > When write to a xfs filesystem > + xfs_io -c fiemap /mnt/test/dd.txt > /mnt/test/dd.txt: > 0: [0..16465919]: 256..16466175 > 1: [16465920..31821623]: 16466176..31821879 > 2: [31821624..41942903]: 31821880..41943159 > 3: [41942904..58720111]: 47183872..63961079 > 4: [58720112..67108863]: 63961080..72349831 Don't waste your time on fiemap output. Btrfs can do merge fie result at write time. # dd if=/dev/zero bs=4k count=1024 oflag=sync of=/mnt/btrfs/file Above command should result all file extents to be 4K, and can be easily verified through dump tree. item 197 key (258 EXTENT_DATA 4186112) itemoff 5789 itemsize 53 generation 8 type 1 (regular) extent data disk byte 22011904 nr 4096 extent data offset 0 nr 4096 ram 4096 But fiemap will only report one single extent: # xfs_io -c "fiemap -v" /mnt/btrfs/file /mnt/btrfs/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..8191]: 34816..43007 8192 0x1 > > the max of xfs is about 8G, but the max of btrfs is > about 25G('6: [16252928..67108863]'? > > Best Regards > Wang Yugui (wangyugui@e16-tech.com) > 2022/10/07 > > > >> Signed-off-by: Boris Burkov <boris@bur.io> >> --- >> fs/btrfs/inode.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >> index 45ebef8d3ea8..fd66586ae2fc 100644 >> --- a/fs/btrfs/inode.c >> +++ b/fs/btrfs/inode.c >> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, >> if (trans) >> own_trans = false; >> while (num_bytes > 0) { >> - cur_bytes = min_t(u64, num_bytes, SZ_256M); >> + cur_bytes = min_t(u64, num_bytes, SZ_1G); >> cur_bytes = max(cur_bytes, min_size); >> /* >> * If we are severely fragmented we could end up with really >> -- >> 2.37.2 > >
On 2022/10/7 11:29, Qu Wenruo wrote: > > > On 2022/10/7 11:23, Wang Yugui wrote: >> Hi, >> >>> When doing a large fallocate, btrfs will break it up into 256MiB >>> extents. Our data block groups are 1GiB, so a more natural maximum size >>> is 1GiB, so that we tend to allocate and fully use block groups rather >>> than fragmenting the file around. >>> >>> This is especially useful if large fallocates tend to be for "round" >>> amounts, which strikes me as a reasonable assumption. >>> >>> While moving to size classes reduces the value of this change, it is >>> also good to compare potential allocator algorithms against just 1G >>> extents. >> >> >> I wrote a 32G file, and the compare the result of 'xfs_io -c fiemap'. >> >> dd conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt >> >> When write to a btrfs filesystem >> + xfs_io -c fiemap /mnt/test/dd.txt >> /mnt/test/dd.txt: >> 0: [0..262143]: 6883584..7145727 >> 1: [262144..524287]: 6367232..6629375 >> 2: [524288..8126463]: 7145728..14747903 >> 3: [8126464..8388607]: 15272064..15534207 >> 4: [8388608..8650751]: 14755840..15017983 >> 5: [8650752..16252927]: 15534208..23136383 >> 6: [16252928..67108863]: 23144448..74000383 >> >> When write to a xfs filesystem >> + xfs_io -c fiemap /mnt/test/dd.txt >> /mnt/test/dd.txt: >> 0: [0..16465919]: 256..16466175 >> 1: [16465920..31821623]: 16466176..31821879 >> 2: [31821624..41942903]: 31821880..41943159 >> 3: [41942904..58720111]: 47183872..63961079 >> 4: [58720112..67108863]: 63961080..72349831 > > Don't waste your time on fiemap output. Btrfs can do merge fie result at > write time. s/write/fiemap emit/ > > # dd if=/dev/zero bs=4k count=1024 oflag=sync of=/mnt/btrfs/file > > Above command should result all file extents to be 4K, and can be easily > verified through dump tree. > > item 197 key (258 EXTENT_DATA 4186112) itemoff 5789 itemsize 53 > generation 8 type 1 (regular) > extent data disk byte 22011904 nr 4096 > extent data offset 0 nr 4096 ram 4096 > > But fiemap will only report one single extent: > > # xfs_io -c "fiemap -v" /mnt/btrfs/file > /mnt/btrfs/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..8191]: 34816..43007 8192 0x1 > >> >> the max of xfs is about 8G, but the max of btrfs is >> about 25G('6: [16252928..67108863]'? >> >> Best Regards >> Wang Yugui (wangyugui@e16-tech.com) >> 2022/10/07 >> >> >> >>> Signed-off-by: Boris Burkov <boris@bur.io> >>> --- >>> fs/btrfs/inode.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c >>> index 45ebef8d3ea8..fd66586ae2fc 100644 >>> --- a/fs/btrfs/inode.c >>> +++ b/fs/btrfs/inode.c >>> @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct >>> inode *inode, int mode, >>> if (trans) >>> own_trans = false; >>> while (num_bytes > 0) { >>> - cur_bytes = min_t(u64, num_bytes, SZ_256M); >>> + cur_bytes = min_t(u64, num_bytes, SZ_1G); >>> cur_bytes = max(cur_bytes, min_size); >>> /* >>> * If we are severely fragmented we could end up with really >>> -- >>> 2.37.2 >> >>
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 45ebef8d3ea8..fd66586ae2fc 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -9884,7 +9884,7 @@ static int __btrfs_prealloc_file_range(struct inode *inode, int mode, if (trans) own_trans = false; while (num_bytes > 0) { - cur_bytes = min_t(u64, num_bytes, SZ_256M); + cur_bytes = min_t(u64, num_bytes, SZ_1G); cur_bytes = max(cur_bytes, min_size); /* * If we are severely fragmented we could end up with really
When doing a large fallocate, btrfs will break it up into 256MiB extents. Our data block groups are 1GiB, so a more natural maximum size is 1GiB, so that we tend to allocate and fully use block groups rather than fragmenting the file around. This is especially useful if large fallocates tend to be for "round" amounts, which strikes me as a reasonable assumption. While moving to size classes reduces the value of this change, it is also good to compare potential allocator algorithms against just 1G extents. Signed-off-by: Boris Burkov <boris@bur.io> --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)