Message ID | 20190522180546.17063-1-bfoster@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | xfs: rework extent allocation | expand |
On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > Hi all, > > This is v2 of the extent allocation rework series. The changes in this > version are mostly associated with code factoring, based on feedback to > v1. The small mode helper refactoring has been isolated and pulled to > the start of the series. The active flag that necessitated the btree > cursor container structure has been pushed into the xfs_btree_cur > private area. The resulting high level allocation code in > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > level of abstraction. Finally, there are various minor cleanups and > fixes. > > On the testing front, I've run a couple more filebench oriented tests > since v1. The first is a high load, large filesystem, parallel file > write+fsync test to try and determine whether the modified near mode > allocation algorithm resulted in larger latencies in the common > (non-fragmented) case. The results show comparable latencies, though the > updated algorithm has a slightly faster overall runtime for whatever > reason. Probably indicative that over so many allocations, saving a few microseconds of CPU time here and there adds up. That's also a fairly good indication that the IO behaviour hasn't dramatically changed between algorithms - we're not adding or removing a huge number of seeks to the workload.... > The second is another filebench test (but with a smaller fileset against > a smaller filesystem), but with the purpose of measuring "locality > effectiveness" of the updated algorithm via post-test analysis of the > resulting/populated filesystem. I've been thinking a bit about how to > test for locality since starting on this series and ultimately came up > with the following, fairly crude heuristic: track and compare the worst > locality allocation for each regular file inode in the fs. OK, that's pretty crude :P > This > essentially locates the most distant extent for each inode, tracks the > delta from that extent to the inode location on disk and calculates the > average worst case delta across the entire set of regular files. The > results show that the updated algorithm provides a comparable level of > locality to the existing algorithm. The problem with this is that worse case locality isn't a particularly useful measure. In general, when you have allocator contention it occurs on the AGF locks and so the allocator skips to the next AG it can lock. That means if we have 32 AGs and 33 allocations in progress at once, the AG that it chosen for allocation is going to be essentially random. This means worst case allocation locality is always going to be "ag skip" distances and so the jumps between AGs are going to largely dominate the measured locality distances. In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be around 220 * 2^30 / 2^9 = ~460m sectors and: > - baseline - min: 8 max: 568752250 median: 434794.5 mean: 11446328 > - test - min: 33 max: 568402234 median: 437405.5 mean: 11752963 > - by-size only - min: 33 max: 568593146 median: 784805 mean: 11912300 max are all >460m sectors and so are AG skip distances. However, the changes you've made affect locality for allocations _within_ an AG, not across the filesystem, and so anything that skips to another AG really needs to be measured differently. i.e. what we really need to measure here is "how close to target did we get?" and for extending writes the target is always the AGBNO of the end of the last extent. The inode itself is only used as the target for the first extent, so using it as the only distance comparison ignores the fact we try to allocate as close to the end of the last extent as possible, not as close to the inode as possible. Hence once a file has jumped AG, it will stay in the new AG and not return to the original AG the inode is in. This means that once the file data changes locality, it tries to keep that same locality for the next data that is written, not force another seek back to the original location. So, AFAICT, the measure of locality we should be using to evaluate the impact to locality of the new algorithm is the distance between sequential extents in a file allocated within the same AG, not the worst case distance from the inode.... Cheers, Dave. (*) Which, in reality, we really should reset because once we jump AG we have no locality target and so should allow the full AG to be considered. This "didn't reset target" issue is something I suspect leads to the infamous "backwards allocation for sequential writes" problems...
On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote: > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > > Hi all, > > > > This is v2 of the extent allocation rework series. The changes in this > > version are mostly associated with code factoring, based on feedback to > > v1. The small mode helper refactoring has been isolated and pulled to > > the start of the series. The active flag that necessitated the btree > > cursor container structure has been pushed into the xfs_btree_cur > > private area. The resulting high level allocation code in > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > > level of abstraction. Finally, there are various minor cleanups and > > fixes. > > > > On the testing front, I've run a couple more filebench oriented tests > > since v1. The first is a high load, large filesystem, parallel file > > write+fsync test to try and determine whether the modified near mode > > allocation algorithm resulted in larger latencies in the common > > (non-fragmented) case. The results show comparable latencies, though the > > updated algorithm has a slightly faster overall runtime for whatever > > reason. > > Probably indicative that over so many allocations, saving a few > microseconds of CPU time here and there adds up. That's also a fairly > good indication that the IO behaviour hasn't dramatically changed > between algorithms - we're not adding or removing a huge number of > seeks to the workload.... > Makes sense. The goal here (and purpose for the higher level testing) is basically to confirm this doesn't break/regress the common (non-fragmented) allocation scenario. I suppose that's a bonus if we find some incremental speedups along the way.. > > The second is another filebench test (but with a smaller fileset against > > a smaller filesystem), but with the purpose of measuring "locality > > effectiveness" of the updated algorithm via post-test analysis of the > > resulting/populated filesystem. I've been thinking a bit about how to > > test for locality since starting on this series and ultimately came up > > with the following, fairly crude heuristic: track and compare the worst > > locality allocation for each regular file inode in the fs. > > OK, that's pretty crude :P > Yeah, this was just a start and I figured it might generate some feedback... ;) > > This > > essentially locates the most distant extent for each inode, tracks the > > delta from that extent to the inode location on disk and calculates the > > average worst case delta across the entire set of regular files. The > > results show that the updated algorithm provides a comparable level of > > locality to the existing algorithm. > > The problem with this is that worse case locality isn't a > particularly useful measure. In general, when you have allocator > contention it occurs on the AGF locks and so the allocator skips to > the next AG it can lock. That means if we have 32 AGs and 33 > allocations in progress at once, the AG that it chosen for > allocation is going to be essentially random. This means worst case > allocation locality is always going to be "ag skip" distances and so > the jumps between AGs are going to largely dominate the measured > locality distances. > Good point. I was thinking about rerunning this with agcount=1 (with a lighter workload) to isolate the analysis to a single AG, but wanted to get this on the list for feedback given the time it takes populate the fs and whatnot. > In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be > around 220 * 2^30 / 2^9 = ~460m sectors and: > > > - baseline - min: 8 max: 568752250 median: 434794.5 mean: 11446328 > > - test - min: 33 max: 568402234 median: 437405.5 mean: 11752963 > > - by-size only - min: 33 max: 568593146 median: 784805 mean: 11912300 > > max are all >460m sectors and so are AG skip distances. > Though it is interesting that the average and median are within that ~460m sector delta. I think that means this is at least catching some information on intra-AG locality as many of these files might not have had to jump AGs. Taking a look at the dataset, this could be because the majority of these files are on the smaller side and consist of one or two extents. There's also plenty of RAM (64GB) on the box I happened to use. For reference, the file sizes in the set are defined by the following filebench mapping: {{50, 4k, 1m}, {35, 10m, 100m}, {10, 500m, 1g}, {5, 1g, 8g} } ... where the first value is the percentage and the next two are a size range. E.g., 50% of files are 4k-1m, 35% are 10m-100m, etc. I can obviously tweak this however necessary to provide most useful results or test different distributions. But while I don't think the current number is completely bogus, I agree that it's diluted by those larger files that do happen to jump AGs and thus it is probably missing critical information about file extension allocation locality. > However, the changes you've made affect locality for allocations > _within_ an AG, not across the filesystem, and so anything that > skips to another AG really needs to be measured differently. > > i.e. what we really need to measure here is "how close to target did > we get?" and for extending writes the target is always the AGBNO of > the end of the last extent. > > The inode itself is only used as the target for the first extent, so > using it as the only distance comparison ignores the fact we try to > allocate as close to the end of the last extent as possible, not as > close to the inode as possible. Hence once a file has jumped AG, it > will stay in the new AG and not return to the original AG the inode > is in. This means that once the file data changes locality, it tries > to keep that same locality for the next data that is written, not > force another seek back to the original location. > Ok, I was thinking that locality is always based on inode location. I see that xfs_bmap_btalloc() assigns ap->blkno (which makes its way to args.fsbno/agbno) based on the inode, but I missed the xfs_bmap_adjacent() call right after that which overrides ap->blkno based on the previous extent, if applicable. So this means that the worst case locality based on inode location is actually invalid for (sequentially written) multi-extent files because once we create a discontiguity, we're not measuring against the locality target that was actually used by the algorithm (even within the same AG). > So, AFAICT, the measure of locality we should be using to evaluate > the impact to locality of the new algorithm is the distance between > sequential extents in a file allocated within the same AG, not the > worst case distance from the inode.... > Yeah, makes sense. I created metadumps of each filesystem created for this test in anticipation of tweaks to the post-processing heuristic. I assume we still want to fold this measurement up into some mean/median locality value for the overall filesystem for comparison purposes, but how would you prefer to see that tracked on a per-inode basis? I could still track the worst case stride from one extent to the next within an inode (provided they sit in the same AG), or we could do something like track the average stride for each inode, or average stride in general across all intra-AG extents. Hmmm.. I suppose if I had a script that just dumped every applicable stride/delta value for an inode, I could dump all of those numbers into a file and we can process it from there.. I'll play around with this some more. Thanks for the feedback! > Cheers, > > Dave. > > (*) Which, in reality, we really should reset because once we jump > AG we have no locality target and so should allow the full AG to be > considered. This "didn't reset target" issue is something I suspect > leads to the infamous "backwards allocation for sequential writes" > problems... > I think this is something that's handled at a higher level. In the nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode which is what allows us to iterate AGs. We start with a near mode allocation in the target AG. If that fails, xfs_alloc_vextent() switches over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set. Brian > -- > Dave Chinner > david@fromorbit.com
On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote: > > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > > > Hi all, > > > > > > This is v2 of the extent allocation rework series. The changes in this > > > version are mostly associated with code factoring, based on feedback to > > > v1. The small mode helper refactoring has been isolated and pulled to > > > the start of the series. The active flag that necessitated the btree > > > cursor container structure has been pushed into the xfs_btree_cur > > > private area. The resulting high level allocation code in > > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > > > level of abstraction. Finally, there are various minor cleanups and > > > fixes. > > > > > > On the testing front, I've run a couple more filebench oriented tests > > > since v1. The first is a high load, large filesystem, parallel file > > > write+fsync test to try and determine whether the modified near mode > > > allocation algorithm resulted in larger latencies in the common > > > (non-fragmented) case. The results show comparable latencies, though the > > > updated algorithm has a slightly faster overall runtime for whatever > > > reason. > > > > Probably indicative that over so many allocations, saving a few > > microseconds of CPU time here and there adds up. That's also a fairly > > good indication that the IO behaviour hasn't dramatically changed > > between algorithms - we're not adding or removing a huge number of > > seeks to the workload.... > > > > Makes sense. The goal here (and purpose for the higher level testing) is > basically to confirm this doesn't break/regress the common > (non-fragmented) allocation scenario. I suppose that's a bonus if we > find some incremental speedups along the way.. *nod* [snip] > > In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be > > around 220 * 2^30 / 2^9 = ~460m sectors and: > > > > > - baseline - min: 8 max: 568752250 median: 434794.5 mean: 11446328 > > > - test - min: 33 max: 568402234 median: 437405.5 mean: 11752963 > > > - by-size only - min: 33 max: 568593146 median: 784805 mean: 11912300 > > > > max are all >460m sectors and so are AG skip distances. > > > > Though it is interesting that the average and median are within that > ~460m sector delta. I think that means this is at least catching some > information on intra-AG locality as many of these files might not have > had to jump AGs. [....] > But while I don't think the current number is completely bogus, I agree > that it's diluted by those larger files that do happen to jump AGs and > thus it is probably missing critical information about file extension > allocation locality. *nod* [....] > > So, AFAICT, the measure of locality we should be using to evaluate > > the impact to locality of the new algorithm is the distance between > > sequential extents in a file allocated within the same AG, not the > > worst case distance from the inode.... > > > > Yeah, makes sense. I created metadumps of each filesystem created for > this test in anticipation of tweaks to the post-processing heuristic. Nice :) > I assume we still want to fold this measurement up into some mean/median > locality value for the overall filesystem for comparison purposes, but > how would you prefer to see that tracked on a per-inode basis? I could > still track the worst case stride from one extent to the next within an > inode (provided they sit in the same AG), or we could do something like > track the average stride for each inode, or average stride in general > across all intra-AG extents. Histograms are probably the best way to demonstrate the general distribution of the data set. Average/median don't really convey much other than whether the dataset is skewed to one end or another, while histograms will give us a good indication in the change of "distance from target" over a wide range of the filesystem. It can even incorporate "skipped AG" as a bucket to indicate how often thatis occurring and whether the change in algorithms increase/decreases the occurence of that. FWIW, I suspect a histogram that describes "extent size vs locality of next extent" will give us an idea of whether small or large allocations have worse locality. I'd also expect a measure like this to give insight into how badly free space fragmentation is affecting the filesystem, as it will tend to push the extent size distribution towards the smaller end of the scale.... > Hmmm.. I suppose if I had a script that > just dumped every applicable stride/delta value for an inode, I could > dump all of those numbers into a file and we can process it from there.. See how the freesp commands work in xfs_db - they just generate a set of {offset, size} tuples that are then bucketted appropriately. This is probably the best way to do this at the moment - have xfs_db walk the inode BMBTs outputting something like {extent size, distance to next extent} tuples and everything else falls out from how we bucket that information. > > (*) Which, in reality, we really should reset because once we jump > > AG we have no locality target and so should allow the full AG to be > > considered. This "didn't reset target" issue is something I suspect > > leads to the infamous "backwards allocation for sequential writes" > > problems... > > I think this is something that's handled at a higher level. In the > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode > which is what allows us to iterate AGs. The nullfb case isn't the problem here - I think the problem comes when we fail the THIS_BNO/NEAR_BNO allocation attempts in xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO with the same fsbno target. > We start with a near mode > allocation in the target AG. If that fails, xfs_alloc_vextent() switches > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set. Hmmm. I think you just pointed out a bug in this code. The initial NEARBNO alloc sets ->min_agbno and ->max_agbno based on the args->agbno, which xfs_alloc_compute_aligned then uses to determine if the found free extent can be used. When we immediately fall back to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset min_agbno/max_agbno, and so even though we ignore the args->agbno target that is set when selecting a free extent, we still call xfs_alloc_compute_aligned() to determine if it is appropriate to use. I think the bug here is that xfs_alloc_compute_aligned() applies the NEARBNO args->min_agbno to the free extent found by the THIS_AG allocation, and so if the large free extent in the new AG overlaps min_agbno it selects the higher up part of the free extent that starts at min_agbno, rather than using the start of the free extent... Of course, I could be missing something obvious (still haven't finished my morning coffee!), so it's worth checking I read the code properly.... Cheers, Dave.
On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote: > > > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > > > > Hi all, > > > > > > > > This is v2 of the extent allocation rework series. The changes in this > > > > version are mostly associated with code factoring, based on feedback to > > > > v1. The small mode helper refactoring has been isolated and pulled to > > > > the start of the series. The active flag that necessitated the btree > > > > cursor container structure has been pushed into the xfs_btree_cur > > > > private area. The resulting high level allocation code in > > > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > > > > level of abstraction. Finally, there are various minor cleanups and > > > > fixes. > > > > > > > > On the testing front, I've run a couple more filebench oriented tests > > > > since v1. The first is a high load, large filesystem, parallel file > > > > write+fsync test to try and determine whether the modified near mode > > > > allocation algorithm resulted in larger latencies in the common > > > > (non-fragmented) case. The results show comparable latencies, though the > > > > updated algorithm has a slightly faster overall runtime for whatever > > > > reason. > > > > > > Probably indicative that over so many allocations, saving a few > > > microseconds of CPU time here and there adds up. That's also a fairly > > > good indication that the IO behaviour hasn't dramatically changed > > > between algorithms - we're not adding or removing a huge number of > > > seeks to the workload.... > > > > > > > Makes sense. The goal here (and purpose for the higher level testing) is > > basically to confirm this doesn't break/regress the common > > (non-fragmented) allocation scenario. I suppose that's a bonus if we > > find some incremental speedups along the way.. > > *nod* > > [snip] > > > > > In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be > > > around 220 * 2^30 / 2^9 = ~460m sectors and: > > > > > > > - baseline - min: 8 max: 568752250 median: 434794.5 mean: 11446328 > > > > - test - min: 33 max: 568402234 median: 437405.5 mean: 11752963 > > > > - by-size only - min: 33 max: 568593146 median: 784805 mean: 11912300 > > > > > > max are all >460m sectors and so are AG skip distances. > > > > > > > Though it is interesting that the average and median are within that > > ~460m sector delta. I think that means this is at least catching some > > information on intra-AG locality as many of these files might not have > > had to jump AGs. > > [....] > > > But while I don't think the current number is completely bogus, I agree > > that it's diluted by those larger files that do happen to jump AGs and > > thus it is probably missing critical information about file extension > > allocation locality. > > *nod* > > [....] > > > > So, AFAICT, the measure of locality we should be using to evaluate > > > the impact to locality of the new algorithm is the distance between > > > sequential extents in a file allocated within the same AG, not the > > > worst case distance from the inode.... > > > > > > > Yeah, makes sense. I created metadumps of each filesystem created for > > this test in anticipation of tweaks to the post-processing heuristic. > > Nice :) > > > I assume we still want to fold this measurement up into some mean/median > > locality value for the overall filesystem for comparison purposes, but > > how would you prefer to see that tracked on a per-inode basis? I could > > still track the worst case stride from one extent to the next within an > > inode (provided they sit in the same AG), or we could do something like > > track the average stride for each inode, or average stride in general > > across all intra-AG extents. > > Histograms are probably the best way to demonstrate the general > distribution of the data set. Average/median don't really convey > much other than whether the dataset is skewed to one end or another, > while histograms will give us a good indication in the change of > "distance from target" over a wide range of the filesystem. It can > even incorporate "skipped AG" as a bucket to indicate how often > thatis occurring and whether the change in algorithms > increase/decreases the occurence of that. > > FWIW, I suspect a histogram that describes "extent size vs locality > of next extent" will give us an idea of whether small or large > allocations have worse locality. I'd also expect a measure like this > to give insight into how badly free space fragmentation is affecting > the filesystem, as it will tend to push the extent size distribution > towards the smaller end of the scale.... > Ok. I actually attempted a histogram of the current dataset but the random CLI utility I found didn't quite spit out what I expected. I was probably using it wrong, but didn't dig much further into it.. > > Hmmm.. I suppose if I had a script that > > just dumped every applicable stride/delta value for an inode, I could > > dump all of those numbers into a file and we can process it from there.. > > See how the freesp commands work in xfs_db - they just generate a > set of {offset, size} tuples that are then bucketted appropriately. > This is probably the best way to do this at the moment - have xfs_db > walk the inode BMBTs outputting something like {extent size, > distance to next extent} tuples and everything else falls out from > how we bucket that information. > That sounds plausible. A bit more involved than what I'm currently working with, but we do already have a blueprint for the scanning implementation required to collect this data via the frag command. Perhaps some of this code between the frag/freesp can be generalized and reused. I'll take a closer look at it. My only concern is I'd prefer to only go down this path as long as we plan to land the associated command in xfs_db. So this approach suggests to me that we add a "locality" command similar to frag/freesp that presents the locality state of the fs. For now I'm only really concerned with the data associated with known near mode allocations (i.e., such as the extent size:distance relationship you've outlined above) so we can evaluate these algorithmic changes, but this would be for fs devs only so we could always expand on it down the road if we want to assess different allocations. Hm? > > > > (*) Which, in reality, we really should reset because once we jump > > > AG we have no locality target and so should allow the full AG to be > > > considered. This "didn't reset target" issue is something I suspect > > > leads to the infamous "backwards allocation for sequential writes" > > > problems... > > > > I think this is something that's handled at a higher level. In the > > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode > > which is what allows us to iterate AGs. > > The nullfb case isn't the problem here - I think the problem comes > when we fail the THIS_BNO/NEAR_BNO allocation attempts in > xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO > with the same fsbno target. > Ok, that's the retry where we restore minlen from maxlen back to 1. Note that still depends on nullfb, but that's just context. I think the majority of the time data allocations come into the allocator as the first allocation in the transaction. That aside, I've also seen that this particular retry is hotter than you'd think just by glancing at the code due to the whole minlen = maxlen thing in xfs_bmap_select_minlen() combined with potential size of delalloc extents. See the patches I posted a couple weeks ago in this area for example: https://marc.info/?l=linux-xfs&m=155671950608062&w=2 ... though I don't think they relate to the issue you're tracking here. > > We start with a near mode > > allocation in the target AG. If that fails, xfs_alloc_vextent() switches > > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in > > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set. > > Hmmm. I think you just pointed out a bug in this code. The initial > NEARBNO alloc sets ->min_agbno and ->max_agbno based on the > args->agbno, which xfs_alloc_compute_aligned then uses to determine > if the found free extent can be used. When we immediately fall back > to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset > min_agbno/max_agbno, and so even though we ignore the args->agbno > target that is set when selecting a free extent, we still call > xfs_alloc_compute_aligned() to determine if it is appropriate to > use. > I don't see where the near mode alloc assigns ->[min|max]_agbno. The only place I see ->min_agbno used is for sparse inode chunk allocation. ->max_agbno is assigned in the near mode alloc when min == 0, but it's set to agsize so I'm guessing that this is just an initialization so the checking logic works with zeroed out values from the caller. Just to clarify.. what exactly is the reverse allocation problem you're thinking about here? Logically increasing writes resulting in physically decreasing allocations? If so, have we confirmed that writes are indeed sequential when this occurs (obvious I guess, but worth asking). I'm not really familiar with this problem so I've not thought about it, but one random thought comes to mind: is there any chance of allocation interlocking behind extent frees contributing to this behavior? We do truncate files in reverse order an extent at a time and immediately finish the dfops (to free the extent and busy it) before we unmap the next piece. If allocations could interlock with such frees (or busy clearing), I could see in theory how that might result in backwards allocations, but I'm just handwaving and not sure that's possible in practice. Brian > I think the bug here is that xfs_alloc_compute_aligned() applies the > NEARBNO args->min_agbno to the free extent found by the THIS_AG > allocation, and so if the large free extent in the new AG overlaps > min_agbno it selects the higher up part of the free extent that > starts at min_agbno, rather than using the start of the free > extent... > > Of course, I could be missing something obvious (still haven't > finished my morning coffee!), so it's worth checking I read the code > properly.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote: > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > > Hmmm.. I suppose if I had a script that > > > just dumped every applicable stride/delta value for an inode, I could > > > dump all of those numbers into a file and we can process it from there.. > > > > See how the freesp commands work in xfs_db - they just generate a > > set of {offset, size} tuples that are then bucketted appropriately. > > This is probably the best way to do this at the moment - have xfs_db > > walk the inode BMBTs outputting something like {extent size, > > distance to next extent} tuples and everything else falls out from > > how we bucket that information. > > > > That sounds plausible. A bit more involved than what I'm currently > working with, but we do already have a blueprint for the scanning > implementation required to collect this data via the frag command. > Perhaps some of this code between the frag/freesp can be generalized and > reused. I'll take a closer look at it. > > My only concern is I'd prefer to only go down this path as long as we > plan to land the associated command in xfs_db. So this approach suggests > to me that we add a "locality" command similar to frag/freesp that > presents the locality state of the fs. For now I'm only really concerned > with the data associated with known near mode allocations (i.e., such as > the extent size:distance relationship you've outlined above) so we can > evaluate these algorithmic changes, but this would be for fs devs only > so we could always expand on it down the road if we want to assess > different allocations. Hm? Yup, I'm needing to do similar analysis myself to determine how quickly I'm aging the filesystem, so having the tool in xfs_db or xfs_spaceman would be very useful. FWIW, the tool I've just started writing will just use fallocate and truncate to hammer the allocation code as hard and as quickly as possible - I want to do accelerated aging of the filesystem, and so being able to run tens to hundreds of thousands of free space manipulations a second is the goal here.... > > > > (*) Which, in reality, we really should reset because once we jump > > > > AG we have no locality target and so should allow the full AG to be > > > > considered. This "didn't reset target" issue is something I suspect > > > > leads to the infamous "backwards allocation for sequential writes" > > > > problems... > > > > > > I think this is something that's handled at a higher level. In the > > > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode > > > which is what allows us to iterate AGs. > > > > The nullfb case isn't the problem here - I think the problem comes > > when we fail the THIS_BNO/NEAR_BNO allocation attempts in > > xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO > > with the same fsbno target. > > > > Ok, that's the retry where we restore minlen from maxlen back to 1. Note > that still depends on nullfb, but that's just context. I think the > majority of the time data allocations come into the allocator as the > first allocation in the transaction. > > That aside, I've also seen that this particular retry is hotter than > you'd think just by glancing at the code due to the whole minlen = > maxlen thing in xfs_bmap_select_minlen() combined with potential size of > delalloc extents. See the patches I posted a couple weeks ago in this > area for example: > > https://marc.info/?l=linux-xfs&m=155671950608062&w=2 > > ... though I don't think they relate to the issue you're tracking here. > > > > We start with a near mode > > > allocation in the target AG. If that fails, xfs_alloc_vextent() switches > > > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in > > > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set. > > > > Hmmm. I think you just pointed out a bug in this code. The initial > > NEARBNO alloc sets ->min_agbno and ->max_agbno based on the > > args->agbno, which xfs_alloc_compute_aligned then uses to determine > > if the found free extent can be used. When we immediately fall back > > to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset > > min_agbno/max_agbno, and so even though we ignore the args->agbno > > target that is set when selecting a free extent, we still call > > xfs_alloc_compute_aligned() to determine if it is appropriate to > > use. > > > > I don't see where the near mode alloc assigns ->[min|max]_agbno. The > only place I see ->min_agbno used is for sparse inode chunk allocation. > ->max_agbno is assigned in the near mode alloc when min == 0, but it's > set to agsize so I'm guessing that this is just an initialization so the > checking logic works with zeroed out values from the caller. I did say: > > Of course, I could be missing something obvious (still haven't > > finished my morning coffee!), so it's worth checking I read the code > > properly.... And that's clearly the case here - I read the code that clamps agbno to min/max backwards.... > Just to clarify.. what exactly is the reverse allocation problem you're > thinking about here? Logically increasing writes resulting in physically > decreasing allocations? If so, have we confirmed that writes are indeed > sequential when this occurs (obvious I guess, but worth asking). Precisely this. And, yes, I've confirmed many times that it is sequential writes that display it. It happens when the closest nearest free extent is below the current target. i.e.: +FFFFFFFFFFFFF+aaaaaaaT+bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb+FFFFF+ We just allocated "a" so the next allocation target is T, but "b" is already allocated there. The freespace before "a" is larger than the free space after "b" (which may be less than maxlen), and so the "nearest best free space" is below the target. If "b" is large enough, we end up with file allocation that looks like (where n = relative file offset for the given allocation) +10+9+8+7+6+5+4+1+2+3+ i.e. allocation starts forward, uses the remaining ascending free space in the extent, then starts allocating backwards as it takes the nearest free space and trims off the top of the free extent for the allocation. > I'm not really familiar with this problem so I've not thought about it, > but one random thought comes to mind: is there any chance of allocation > interlocking behind extent frees contributing to this behavior? We do > truncate files in reverse order an extent at a time and immediately > finish the dfops (to free the extent and busy it) before we unmap the > next piece. Unlikely, because such extents are not immediately avaiable for allocation (they are busy). And I've seen it on pure contended sequential write workloads, too... > If allocations could interlock with such frees (or busy > clearing), I could see in theory how that might result in backwards > allocations, but I'm just handwaving and not sure that's possible in > practice. Most of the cases I've seen have had the same symptom - "skip to next AG, allocate at same high-up-in AGBNO target as the previous AG wanted, then allocate backwards in the same AG until freespace extent is exhausted. It then skips to some other freespace extent, and depending on whether it's a forwards or backwards skip the problem either goes away or continues. This is not a new behaviour, I first saw it some 15 years ago, but I've never been able to provoke it reliably enough with test code to get to the root cause... Cheers, Dave.
On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote: > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > > > Hmmm.. I suppose if I had a script that > > > > just dumped every applicable stride/delta value for an inode, I could > > > > dump all of those numbers into a file and we can process it from there.. > > > > > > See how the freesp commands work in xfs_db - they just generate a > > > set of {offset, size} tuples that are then bucketted appropriately. > > > This is probably the best way to do this at the moment - have xfs_db > > > walk the inode BMBTs outputting something like {extent size, > > > distance to next extent} tuples and everything else falls out from > > > how we bucket that information. > > > > > > > That sounds plausible. A bit more involved than what I'm currently > > working with, but we do already have a blueprint for the scanning > > implementation required to collect this data via the frag command. > > Perhaps some of this code between the frag/freesp can be generalized and > > reused. I'll take a closer look at it. > > > > My only concern is I'd prefer to only go down this path as long as we > > plan to land the associated command in xfs_db. So this approach suggests > > to me that we add a "locality" command similar to frag/freesp that > > presents the locality state of the fs. For now I'm only really concerned > > with the data associated with known near mode allocations (i.e., such as > > the extent size:distance relationship you've outlined above) so we can > > evaluate these algorithmic changes, but this would be for fs devs only > > so we could always expand on it down the road if we want to assess > > different allocations. Hm? > > Yup, I'm needing to do similar analysis myself to determine how > quickly I'm aging the filesystem, so having the tool in xfs_db or > xfs_spaceman would be very useful. > > FWIW, the tool I've just started writing will just use fallocate and > truncate to hammer the allocation code as hard and as quickly as > possible - I want to do accelerated aging of the filesystem, and so > being able to run tens to hundreds of thousands of free space > manipulations a second is the goal here.... > Ok. FWIW, from playing with this so far (before getting distracted for much of this week) the most straightforward place to add this kind of functionality turns out to be the frag command itself. It does 99% of the work required to process data extents already, including pulling the on-disk records of each inode in-core for processing. I basically just had to update that code to include all of the record data and add the locality tracking logic (I haven't got to actually presenting it yet..). > > > > > (*) Which, in reality, we really should reset because once we jump > > > > > AG we have no locality target and so should allow the full AG to be > > > > > considered. This "didn't reset target" issue is something I suspect > > > > > leads to the infamous "backwards allocation for sequential writes" > > > > > problems... > > > > > > > > I think this is something that's handled at a higher level. In the > > > > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode > > > > which is what allows us to iterate AGs. > > > > > > The nullfb case isn't the problem here - I think the problem comes > > > when we fail the THIS_BNO/NEAR_BNO allocation attempts in > > > xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO > > > with the same fsbno target. > > > > > > > Ok, that's the retry where we restore minlen from maxlen back to 1. Note > > that still depends on nullfb, but that's just context. I think the > > majority of the time data allocations come into the allocator as the > > first allocation in the transaction. > > > > That aside, I've also seen that this particular retry is hotter than > > you'd think just by glancing at the code due to the whole minlen = > > maxlen thing in xfs_bmap_select_minlen() combined with potential size of > > delalloc extents. See the patches I posted a couple weeks ago in this > > area for example: > > > > https://marc.info/?l=linux-xfs&m=155671950608062&w=2 > > > > ... though I don't think they relate to the issue you're tracking here. > > > > > > We start with a near mode > > > > allocation in the target AG. If that fails, xfs_alloc_vextent() switches > > > > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in > > > > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set. > > > > > > Hmmm. I think you just pointed out a bug in this code. The initial > > > NEARBNO alloc sets ->min_agbno and ->max_agbno based on the > > > args->agbno, which xfs_alloc_compute_aligned then uses to determine > > > if the found free extent can be used. When we immediately fall back > > > to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset > > > min_agbno/max_agbno, and so even though we ignore the args->agbno > > > target that is set when selecting a free extent, we still call > > > xfs_alloc_compute_aligned() to determine if it is appropriate to > > > use. > > > > > > > I don't see where the near mode alloc assigns ->[min|max]_agbno. The > > only place I see ->min_agbno used is for sparse inode chunk allocation. > > ->max_agbno is assigned in the near mode alloc when min == 0, but it's > > set to agsize so I'm guessing that this is just an initialization so the > > checking logic works with zeroed out values from the caller. > > I did say: > > > > Of course, I could be missing something obvious (still haven't > > > finished my morning coffee!), so it's worth checking I read the code > > > properly.... > > And that's clearly the case here - I read the code that clamps agbno > to min/max backwards.... > Ok.. > > Just to clarify.. what exactly is the reverse allocation problem you're > > thinking about here? Logically increasing writes resulting in physically > > decreasing allocations? If so, have we confirmed that writes are indeed > > sequential when this occurs (obvious I guess, but worth asking). > > Precisely this. And, yes, I've confirmed many times that it is > sequential writes that display it. It happens when the closest nearest free > extent is below the current target. i.e.: > > > +FFFFFFFFFFFFF+aaaaaaaT+bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb+FFFFF+ > > We just allocated "a" so the next allocation target is T, but "b" is > already allocated there. The freespace before "a" is larger than > the free space after "b" (which may be less than maxlen), and so the > "nearest best free space" is below the target. If "b" is large > enough, we end up with file allocation that looks like (where n = > relative file offset for the given allocation) > > +10+9+8+7+6+5+4+1+2+3+ > > i.e. allocation starts forward, uses the remaining ascending free > space in the extent, then starts allocating backwards as it takes > the nearest free space and trims off the top of the free extent for > the allocation. > Hmm, so are you saying that the allocation is not only physically reversed, but the (physically out of order) ascending logical extents in the file are actually physically contiguous as well? IOW, if freed, they would form a single/contiguous free extent? If so, any idea how common that pattern is to the instances of this problem you've seen? I.e., is it always the case, sometimes the case..? Also, any intuition on the size of these extents relative to the file? > > I'm not really familiar with this problem so I've not thought about it, > > but one random thought comes to mind: is there any chance of allocation > > interlocking behind extent frees contributing to this behavior? We do > > truncate files in reverse order an extent at a time and immediately > > finish the dfops (to free the extent and busy it) before we unmap the > > next piece. > > Unlikely, because such extents are not immediately avaiable for > allocation (they are busy). And I've seen it on pure contended > sequential write workloads, too... > I was thinking that unbusying of extents could be similarly interlocked across log buffer completions, but your latter point and the additional context suggests this is more likely influenced by a combination allocation timing (i.e., AGF locking to determine when we jump AGs) and free space state than anything like this. > > If allocations could interlock with such frees (or busy > > clearing), I could see in theory how that might result in backwards > > allocations, but I'm just handwaving and not sure that's possible in > > practice. > > Most of the cases I've seen have had the same symptom - "skip to > next AG, allocate at same high-up-in AGBNO target as the previous AG > wanted, then allocate backwards in the same AG until freespace > extent is exhausted. It then skips to some other freespace extent, > and depending on whether it's a forwards or backwards skip the > problem either goes away or continues. This is not a new behaviour, > I first saw it some 15 years ago, but I've never been able to > provoke it reliably enough with test code to get to the root > cause... > I guess the biggest question to me is whether we're more looking for a backwards searching pattern or a pattern where we split up a larger free extent into smaller chunks (in reverse), or both. I can definitely see some isolated areas where a backwards search could lead to this behavior. E.g., my previous experiment to replace near mode allocs with size mode allocs always allocates in reverse when free space is sufficiently fragmented. To see this in practice would require repeated size mode allocations, however, which I think is unlikely because once we jump AGs and do a size mode alloc, the subsequent allocs should be near mode within the new AG (unless we jump again and again, which I don't think is consistent with what you're describing). Hmm, the next opportunity for this kind of behavior in the near mode allocator is probably the bnobt left/right span. This would require the right circumstances to hit. We'd have to bypass the first (cntbt) algorithm then find a closer extent in the left mode search vs. the right mode search, and then probably repeat that across however many allocations it takes to work out of this state. If instead we're badly breaking up an extent in the wrong order, it looks like we do have the capability to allocate the right portion of an extent (in xfs_alloc_compute_diff()) but that is only enabled for non data allocations. xfs_alloc_compute_aligned() can cause a similar effect if alignment is set, but I'm not sure that would break up an extent into more than one usable chunk. In any event, maybe I'll hack some temporary code in the xfs_db locality stuff to quickly flag whether I happen to get lucky enough to reproduce any instances of this during the associated test workloads (and if so, try and collect more data). Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote: > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote: > > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > > > > Hmmm.. I suppose if I had a script that > > > > > just dumped every applicable stride/delta value for an inode, I could > > > > > dump all of those numbers into a file and we can process it from there.. > > > > > > > > See how the freesp commands work in xfs_db - they just generate a > > > > set of {offset, size} tuples that are then bucketted appropriately. > > > > This is probably the best way to do this at the moment - have xfs_db > > > > walk the inode BMBTs outputting something like {extent size, > > > > distance to next extent} tuples and everything else falls out from > > > > how we bucket that information. > > > > > > > > > > That sounds plausible. A bit more involved than what I'm currently > > > working with, but we do already have a blueprint for the scanning > > > implementation required to collect this data via the frag command. > > > Perhaps some of this code between the frag/freesp can be generalized and > > > reused. I'll take a closer look at it. > > > > > > My only concern is I'd prefer to only go down this path as long as we > > > plan to land the associated command in xfs_db. So this approach suggests > > > to me that we add a "locality" command similar to frag/freesp that > > > presents the locality state of the fs. For now I'm only really concerned > > > with the data associated with known near mode allocations (i.e., such as > > > the extent size:distance relationship you've outlined above) so we can > > > evaluate these algorithmic changes, but this would be for fs devs only > > > so we could always expand on it down the road if we want to assess > > > different allocations. Hm? > > > > Yup, I'm needing to do similar analysis myself to determine how > > quickly I'm aging the filesystem, so having the tool in xfs_db or > > xfs_spaceman would be very useful. > > > > FWIW, the tool I've just started writing will just use fallocate and > > truncate to hammer the allocation code as hard and as quickly as > > possible - I want to do accelerated aging of the filesystem, and so > > being able to run tens to hundreds of thousands of free space > > manipulations a second is the goal here.... > > > > Ok. FWIW, from playing with this so far (before getting distracted for > much of this week) the most straightforward place to add this kind of > functionality turns out to be the frag command itself. It does 99% of > the work required to process data extents already, including pulling the > on-disk records of each inode in-core for processing. I basically just > had to update that code to include all of the record data and add the > locality tracking logic (I haven't got to actually presenting it yet..). > I managed to collect some preliminary data based on this strategy. I regenerated the associated dataset as I wanted to introduce more near mode allocation events where locality is relevant/measurable. The set is still generated via filebench, but the workload now runs file 8x file creators in parallel with 16x random size file appenders (with 1% of the dataset being preallocated to support the appends, and thus without contention). In total, it creates 10k files that amount to ~5.8TB of space. The filesystem geometry is as follows: meta-data=/dev/mapper/fedora_hpe--apollo--cn99xx--19-tmp isize=512 agcount=8, agsize=268435455 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=1941012480, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 The locality data is collected via a hacked up variant of 'xfs_db -c frag ...' that reuses the 'freesp' command histogram to display locality deltas instead of extent lengths. Each bucket shows how many extents/allocs fall into that particular delta range and the average length of the associated extents. Note that the percent column is based on extent count vs the total (not block or delta counts). Finally, note that the final row uses a dummy value of agsize to account AG jumps. As such, the associated delta values are invalid, but the extent count/size values are valid. This data is included to provide some insight on how often we fall back to external AGs (from the locality target) due to contention. Bear in mind that so far I've only run this workload once on each kernel and I believe there is opportunity for variance run-to-run. Current data and observations follow: - Baseline kernel (5.2.0-rc1): Files on this filesystem average 59.81 extents per file from to extents delta avgsz pct 0 0 1 0 4879 0.00 1 1 18 18 633 0.00 2 3 76 193 630 0.01 4 7 117 644 10801 0.02 8 15 246 2735 701 0.04 16 31 411 9769 873 0.07 32 63 858 40614 877 0.14 64 127 1931 183122 872 0.32 128 255 3658 693079 1423 0.60 256 511 4393 1619094 767 0.73 512 1023 6049 4582491 666 1.00 1024 2047 19564 34684608 810 3.23 2048 4095 21391 62471744 828 3.53 4096 8191 24735 140424459 920 4.08 8192 16383 18383 213465918 1030 3.03 16384 32767 14447 336272956 1195 2.38 32768 65535 12359 580683797 1154 2.04 65536 131071 11943 1138606730 1446 1.97 131072 262143 16825 3279118006 1701 2.78 262144 524287 32133 12725299194 1905 5.30 524288 1048575 58899 45775066024 1845 9.72 1048576 2097151 95302 147197567651 2020 15.73 2097152 4194303 86659 252037848233 2794 14.30 4194304 8388607 67067 397513880288 2876 11.07 8388608 16777215 47940 583161227489 2319 7.91 16777216 33554431 24878 537577890034 3321 4.11 33554432 67108863 5889 269065981311 16106 0.97 67108864 134217727 3022 304867554478 33429 0.50 134217728 268435454 2994 539180795612 35744 0.49 268435455 268435455 23709 6364336202595 4840 3.91 - Control (base w/ unconditional SIZE mode allocs): Files on this filesystem average 58.60 extents per file from to extents delta avgsz pct 0 0 1 0 180 0.00 4 7 19 115 11379 0.00 8 15 21 231 15 0.00 16 31 3 58 45 0.00 64 127 3 288 7124 0.00 128 255 4 780 60296 0.00 256 511 3 978 51563 0.00 512 1023 9 7072 105727 0.00 1024 2047 33 50773 4765 0.01 2048 4095 98 306258 15689 0.02 4096 8191 258 1513775 1981 0.04 8192 16383 458 5633481 2537 0.08 16384 32767 934 23078682 3013 0.16 32768 65535 1783 87851701 3109 0.30 65536 131071 3382 332685185 1810 0.57 131072 262143 8904 1784659842 2275 1.50 262144 524287 23878 9433551033 1903 4.02 524288 1048575 54422 42483032893 1894 9.17 1048576 2097151 97884 148883431239 2048 16.49 2097152 4194303 81999 236737184381 2741 13.81 4194304 8388607 86826 510450130696 2639 14.63 8388608 16777215 54870 652250378434 2101 9.24 16777216 33554431 40408 985568011752 1959 6.81 33554432 67108863 46354 2258464180697 2538 7.81 67108864 134217727 59461 5705095317989 3380 10.02 134217728 268435454 16205 2676447855794 4891 2.73 268435455 268435455 15423 4140080022465 5243 2.60 - Test (base + this series): Files on this filesystem average 59.76 extents per file from to extents delta avgsz pct 0 0 2 0 419 0.00 1 1 258 258 387 0.04 2 3 81 201 598 0.01 4 7 139 790 13824 0.02 8 15 257 2795 710 0.04 16 31 417 9790 852 0.07 32 63 643 30714 901 0.11 64 127 1158 110148 835 0.19 128 255 1947 370953 822 0.32 256 511 3567 1348313 744 0.59 512 1023 5151 3921794 695 0.85 1024 2047 22895 39640251 924 3.78 2048 4095 34375 100757727 922 5.68 4096 8191 30381 171773183 914 5.02 8192 16383 18977 214896159 1091 3.13 16384 32767 8460 192726268 1212 1.40 32768 65535 6071 286544623 1611 1.00 65536 131071 7803 757765738 1680 1.29 131072 262143 15300 3001300980 1877 2.53 262144 524287 27218 10868169139 1993 4.50 524288 1048575 60423 47321126020 1948 9.98 1048576 2097151 100683 158191884842 2174 16.63 2097152 4194303 92642 274106200889 2508 15.30 4194304 8388607 73987 436219940202 2421 12.22 8388608 16777215 49636 591854981117 2434 8.20 16777216 33554431 15716 353157130237 4772 2.60 33554432 67108863 4948 228004142686 19221 0.82 67108864 134217727 2381 231811967738 35781 0.39 134217728 268435454 2140 385403697868 29221 0.35 268435455 268435455 17746 4763655584430 7603 2.93 Firstly, comparison of the baseline and control data shows that near mode allocation is effective at improving allocation locality compared to size mode. In both cases, the 1048576-4194304 buckets hold the majority of extents. If we look at the sub-1048576 data, however, ~40% of allocations land in this range on the baseline kernel vs. only ~16% for the control. Another interesting data point is the noticeable drop in AG skips (~24k) from the baseline kernel to the control (~15k). I suspect this is due to the additional overhead of locality based allocation causing more contention. Comparison of the baseline and test data shows a generally similar breakdown between the two. The sub-1048576 range populates the same buckets and makes up ~41% of the total extents. The per-bucket numbers differ, but all of the buckets are within a percentage point or so. One previously unknown advantage of the test algorithm shown by this data is that the number of AG skips drops down to ~18k, which almost splits the difference between the baseline and control (and slightly in favor of the latter). I suspect that is related to the more simple and bounded near mode algorithm as compared to the current Thoughts on any of this data or presentation? I could dig further into details or alternatively base the histogram on something like extent size and show the average delta for each extent size bucket, but I'm not sure that will tell us anything profound with respect to this patchset. One thing I noticed while processing this data is that the current dataset skews heavily towards smaller allocations. I still think it's a useful comparison because smaller allocations are more likely to stress either algorithm via a larger locality search space, but I may try to repeat this test with a workload with fewer files and larger allocations and see how that changes things. Brian
On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote: > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > > Most of the cases I've seen have had the same symptom - "skip to > > next AG, allocate at same high-up-in AGBNO target as the previous AG > > wanted, then allocate backwards in the same AG until freespace > > extent is exhausted. It then skips to some other freespace extent, > > and depending on whether it's a forwards or backwards skip the > > problem either goes away or continues. This is not a new behaviour, > > I first saw it some 15 years ago, but I've never been able to > > provoke it reliably enough with test code to get to the root > > cause... > > > > I guess the biggest question to me is whether we're more looking for a > backwards searching pattern or a pattern where we split up a larger free > extent into smaller chunks (in reverse), or both. I can definitely see > some isolated areas where a backwards search could lead to this > behavior. E.g., my previous experiment to replace near mode allocs with > size mode allocs always allocates in reverse when free space is > sufficiently fragmented. To see this in practice would require repeated > size mode allocations, however, which I think is unlikely because once > we jump AGs and do a size mode alloc, the subsequent allocs should be > near mode within the new AG (unless we jump again and again, which I > don't think is consistent with what you're describing). > > Hmm, the next opportunity for this kind of behavior in the near mode > allocator is probably the bnobt left/right span. This would require the > right circumstances to hit. We'd have to bypass the first (cntbt) > algorithm then find a closer extent in the left mode search vs. the > right mode search, and then probably repeat that across however many > allocations it takes to work out of this state. > > If instead we're badly breaking up an extent in the wrong order, it > looks like we do have the capability to allocate the right portion of an > extent (in xfs_alloc_compute_diff()) but that is only enabled for non > data allocations. xfs_alloc_compute_aligned() can cause a similar effect > if alignment is set, but I'm not sure that would break up an extent into > more than one usable chunk. This is pretty much matches what I've been able to infer about the cause, but lacking a way to actually trigger it and be able to monitor the behviour in real time is where I've got stuck on this. I see the result in aged, fragmented filesystems and can infer how it may have occurred, but can't cause it to occur on demand... > In any event, maybe I'll hack some temporary code in the xfs_db locality > stuff to quickly flag whether I happen to get lucky enough to reproduce > any instances of this during the associated test workloads (and if so, > try and collect more data). *nod* Best we can do, I think, and hope we stumble across an easily reproducable trigger... Cheers, Dave.
On Thu, Jun 06, 2019 at 11:21:04AM -0400, Brian Foster wrote: > On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote: > > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > > > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote: > > > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > > > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > > > > > Hmmm.. I suppose if I had a script that > > > > > > just dumped every applicable stride/delta value for an inode, I could > > > > > > dump all of those numbers into a file and we can process it from there.. > > > > > > > > > > See how the freesp commands work in xfs_db - they just generate a > > > > > set of {offset, size} tuples that are then bucketted appropriately. > > > > > This is probably the best way to do this at the moment - have xfs_db > > > > > walk the inode BMBTs outputting something like {extent size, > > > > > distance to next extent} tuples and everything else falls out from > > > > > how we bucket that information. > > > > > > > > > > > > > That sounds plausible. A bit more involved than what I'm currently > > > > working with, but we do already have a blueprint for the scanning > > > > implementation required to collect this data via the frag command. > > > > Perhaps some of this code between the frag/freesp can be generalized and > > > > reused. I'll take a closer look at it. > > > > > > > > My only concern is I'd prefer to only go down this path as long as we > > > > plan to land the associated command in xfs_db. So this approach suggests > > > > to me that we add a "locality" command similar to frag/freesp that > > > > presents the locality state of the fs. For now I'm only really concerned > > > > with the data associated with known near mode allocations (i.e., such as > > > > the extent size:distance relationship you've outlined above) so we can > > > > evaluate these algorithmic changes, but this would be for fs devs only > > > > so we could always expand on it down the road if we want to assess > > > > different allocations. Hm? > > > > > > Yup, I'm needing to do similar analysis myself to determine how > > > quickly I'm aging the filesystem, so having the tool in xfs_db or > > > xfs_spaceman would be very useful. > > > > > > FWIW, the tool I've just started writing will just use fallocate and > > > truncate to hammer the allocation code as hard and as quickly as > > > possible - I want to do accelerated aging of the filesystem, and so > > > being able to run tens to hundreds of thousands of free space > > > manipulations a second is the goal here.... > > > > > > > Ok. FWIW, from playing with this so far (before getting distracted for > > much of this week) the most straightforward place to add this kind of > > functionality turns out to be the frag command itself. It does 99% of > > the work required to process data extents already, including pulling the > > on-disk records of each inode in-core for processing. I basically just > > had to update that code to include all of the record data and add the > > locality tracking logic (I haven't got to actually presenting it yet..). > > > > I managed to collect some preliminary data based on this strategy. I .... > > Comparison of the baseline and test data shows a generally similar > breakdown between the two. Which is the result I wanted to see :) > Thoughts on any of this data or presentation? I think it's useful for comparing whether an allocator change has affected the overall locality of allocation. If it's working as we expect, you should get vastly different results for inode32 vs inode64 mount options, with inode32 showing much, much higher distances for most allocations, so it might be worth running a quick test to confirm that it does, indeed, demonstrate the results we'd expect from such a change. > I could dig further into > details or alternatively base the histogram on something like extent > size and show the average delta for each extent size bucket, but I'm not > sure that will tell us anything profound with respect to this patchset. *nod* > One thing I noticed while processing this data is that the current > dataset skews heavily towards smaller allocations. I still think it's a > useful comparison because smaller allocations are more likely to stress > either algorithm via a larger locality search space, but I may try to > repeat this test with a workload with fewer files and larger allocations > and see how that changes things. From the testing I've been doing, I think the file count of around 10k isn't sufficient to really cause severe allocation issues. Directory and inodes metadata are great for fragmenting free space, so dramtically increasing the number of smaller files might actually produce worse behaviour.... Cheers, Dave.
On Fri, Jun 07, 2019 at 08:05:59AM +1000, Dave Chinner wrote: > On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote: > > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > > > Most of the cases I've seen have had the same symptom - "skip to > > > next AG, allocate at same high-up-in AGBNO target as the previous AG > > > wanted, then allocate backwards in the same AG until freespace > > > extent is exhausted. It then skips to some other freespace extent, > > > and depending on whether it's a forwards or backwards skip the > > > problem either goes away or continues. This is not a new behaviour, > > > I first saw it some 15 years ago, but I've never been able to > > > provoke it reliably enough with test code to get to the root > > > cause... > > > > > > > I guess the biggest question to me is whether we're more looking for a > > backwards searching pattern or a pattern where we split up a larger free > > extent into smaller chunks (in reverse), or both. I can definitely see > > some isolated areas where a backwards search could lead to this > > behavior. E.g., my previous experiment to replace near mode allocs with > > size mode allocs always allocates in reverse when free space is > > sufficiently fragmented. To see this in practice would require repeated > > size mode allocations, however, which I think is unlikely because once > > we jump AGs and do a size mode alloc, the subsequent allocs should be > > near mode within the new AG (unless we jump again and again, which I > > don't think is consistent with what you're describing). > > > > Hmm, the next opportunity for this kind of behavior in the near mode > > allocator is probably the bnobt left/right span. This would require the > > right circumstances to hit. We'd have to bypass the first (cntbt) > > algorithm then find a closer extent in the left mode search vs. the > > right mode search, and then probably repeat that across however many > > allocations it takes to work out of this state. > > > > If instead we're badly breaking up an extent in the wrong order, it > > looks like we do have the capability to allocate the right portion of an > > extent (in xfs_alloc_compute_diff()) but that is only enabled for non > > data allocations. xfs_alloc_compute_aligned() can cause a similar effect > > if alignment is set, but I'm not sure that would break up an extent into > > more than one usable chunk. > > This is pretty much matches what I've been able to infer about the > cause, but lacking a way to actually trigger it and be able to > monitor the behviour in real time is where I've got stuck on this. > I see the result in aged, fragmented filesystems and can infer how > it may have occurred, but can't cause it to occur on demand... > Ok. > > In any event, maybe I'll hack some temporary code in the xfs_db locality > > stuff to quickly flag whether I happen to get lucky enough to reproduce > > any instances of this during the associated test workloads (and if so, > > try and collect more data). > > *nod* > > Best we can do, I think, and hope we stumble across an easily > reproducable trigger... > Unfortunately I haven't seen any instances of this in the workloads I ran to generate the most recent datasets. I have a couple more experiments to try though so I'll keep an eye out. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Fri, Jun 07, 2019 at 08:13:01AM +1000, Dave Chinner wrote: > On Thu, Jun 06, 2019 at 11:21:04AM -0400, Brian Foster wrote: > > On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote: > > > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote: > > > > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote: > > > > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote: > > > > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote: > > > > > > > Hmmm.. I suppose if I had a script that > > > > > > > just dumped every applicable stride/delta value for an inode, I could > > > > > > > dump all of those numbers into a file and we can process it from there.. > > > > > > > > > > > > See how the freesp commands work in xfs_db - they just generate a > > > > > > set of {offset, size} tuples that are then bucketted appropriately. > > > > > > This is probably the best way to do this at the moment - have xfs_db > > > > > > walk the inode BMBTs outputting something like {extent size, > > > > > > distance to next extent} tuples and everything else falls out from > > > > > > how we bucket that information. > > > > > > > > > > > > > > > > That sounds plausible. A bit more involved than what I'm currently > > > > > working with, but we do already have a blueprint for the scanning > > > > > implementation required to collect this data via the frag command. > > > > > Perhaps some of this code between the frag/freesp can be generalized and > > > > > reused. I'll take a closer look at it. > > > > > > > > > > My only concern is I'd prefer to only go down this path as long as we > > > > > plan to land the associated command in xfs_db. So this approach suggests > > > > > to me that we add a "locality" command similar to frag/freesp that > > > > > presents the locality state of the fs. For now I'm only really concerned > > > > > with the data associated with known near mode allocations (i.e., such as > > > > > the extent size:distance relationship you've outlined above) so we can > > > > > evaluate these algorithmic changes, but this would be for fs devs only > > > > > so we could always expand on it down the road if we want to assess > > > > > different allocations. Hm? > > > > > > > > Yup, I'm needing to do similar analysis myself to determine how > > > > quickly I'm aging the filesystem, so having the tool in xfs_db or > > > > xfs_spaceman would be very useful. > > > > > > > > FWIW, the tool I've just started writing will just use fallocate and > > > > truncate to hammer the allocation code as hard and as quickly as > > > > possible - I want to do accelerated aging of the filesystem, and so > > > > being able to run tens to hundreds of thousands of free space > > > > manipulations a second is the goal here.... > > > > > > > > > > Ok. FWIW, from playing with this so far (before getting distracted for > > > much of this week) the most straightforward place to add this kind of > > > functionality turns out to be the frag command itself. It does 99% of > > > the work required to process data extents already, including pulling the > > > on-disk records of each inode in-core for processing. I basically just > > > had to update that code to include all of the record data and add the > > > locality tracking logic (I haven't got to actually presenting it yet..). > > > > > > > I managed to collect some preliminary data based on this strategy. I > .... > > > > Comparison of the baseline and test data shows a generally similar > > breakdown between the two. > > Which is the result I wanted to see :) > Indeed. ;) > > Thoughts on any of this data or presentation? > > I think it's useful for comparing whether an allocator change has > affected the overall locality of allocation. If it's working as we > expect, you should get vastly different results for inode32 vs > inode64 mount options, with inode32 showing much, much higher > distances for most allocations, so it might be worth running a quick > test to confirm that it does, indeed, demonstrate the results we'd > expect from such a change. > Ok, I can add inode32 into the mix as a sanity check. I'm guessing this will result in an increase in AG skips. I also think that once we have that first data allocation, the existing heuristics may apply similarly since we're only concerned about contiguity with the previous extent at that point. Perhaps I'll reserve this for a test with an even higher inode count so there are more first data extent allocations (where the inode is the locality target, but sits in a metadata preferred AG) to measure. > > I could dig further into > > details or alternatively base the histogram on something like extent > > size and show the average delta for each extent size bucket, but I'm not > > sure that will tell us anything profound with respect to this patchset. > > *nod* > > > One thing I noticed while processing this data is that the current > > dataset skews heavily towards smaller allocations. I still think it's a > > useful comparison because smaller allocations are more likely to stress > > either algorithm via a larger locality search space, but I may try to > > repeat this test with a workload with fewer files and larger allocations > > and see how that changes things. > > From the testing I've been doing, I think the file count of around > 10k isn't sufficient to really cause severe allocation issues. > Directory and inodes metadata are great for fragmenting free space, > so dramtically increasing the number of smaller files might actually > produce worse behaviour.... > I'd still like to see the results for larger allocations if for nothing else that I was hoping for more coverage from the most recent test. Once I get through that, I'll move back in the smaller file direction and increase the file count as well as the span of the directory tree. Hmm, random file sizes between 4k and say 96k or so should allow me to at least create tens of millions of files/dirs (I'd like to get above 64k for some of these allocations to cover post-eof prealloc). That should stress the allocator as noted with the additional inode/dir metadata allocations as well as facilitate inode32 experiments. Thanks for the feedback.. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > Hi all, > > This is v2 of the extent allocation rework series. The changes in this > version are mostly associated with code factoring, based on feedback to > v1. The small mode helper refactoring has been isolated and pulled to > the start of the series. The active flag that necessitated the btree > cursor container structure has been pushed into the xfs_btree_cur > private area. The resulting high level allocation code in > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > level of abstraction. Finally, there are various minor cleanups and > fixes. Hmmm. Just for fun I decided to apply this series to see what would happen, and on a 1k block filesystem shook out a test failure that seems like it could be related? MKFS_OPTIONS='-f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf' MOUNT_OPTIONS='-o usrquota,grpquota,prjquota /dev/sdf /opt' --D --- generic/223.out +++ generic/223.out.bad @@ -48,7 +48,7 @@ === Testing size 1g falloc on 8k stripe === SCRATCH_MNT/file-1g-falloc: well-aligned === Testing size 1073745920 falloc on 8k stripe === -SCRATCH_MNT/file-1073745920-falloc: well-aligned +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 2 === mkfs with su 4 blocks x 4 === === Testing size 1*16k on 16k stripe === SCRATCH_MNT/file-1-16384-falloc: well-aligned @@ -98,7 +98,7 @@ === Testing size 1g falloc on 16k stripe === SCRATCH_MNT/file-1g-falloc: well-aligned === Testing size 1073745920 falloc on 16k stripe === -SCRATCH_MNT/file-1073745920-falloc: well-aligned +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 4 === mkfs with su 8 blocks x 4 === === Testing size 1*32k on 32k stripe === SCRATCH_MNT/file-1-32768-falloc: well-aligned @@ -148,7 +148,7 @@ === Testing size 1g falloc on 32k stripe === SCRATCH_MNT/file-1g-falloc: well-aligned === Testing size 1073745920 falloc on 32k stripe === -SCRATCH_MNT/file-1073745920-falloc: well-aligned +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 8 === mkfs with su 16 blocks x 4 === === Testing size 1*64k on 64k stripe === SCRATCH_MNT/file-1-65536-falloc: well-aligned @@ -198,7 +198,7 @@ === Testing size 1g falloc on 64k stripe === SCRATCH_MNT/file-1g-falloc: well-aligned === Testing size 1073745920 falloc on 64k stripe === -SCRATCH_MNT/file-1073745920-falloc: well-aligned +SCRATCH_MNT/file-1073745920-falloc: Start block 197665 not multiple of sunit 16 === mkfs with su 32 blocks x 4 === === Testing size 1*128k on 128k stripe === SCRATCH_MNT/file-1-131072-falloc: well-aligned
On Fri, Jun 21, 2019 at 08:18:35AM -0700, Darrick J. Wong wrote: > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote: > > Hi all, > > > > This is v2 of the extent allocation rework series. The changes in this > > version are mostly associated with code factoring, based on feedback to > > v1. The small mode helper refactoring has been isolated and pulled to > > the start of the series. The active flag that necessitated the btree > > cursor container structure has been pushed into the xfs_btree_cur > > private area. The resulting high level allocation code in > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary > > level of abstraction. Finally, there are various minor cleanups and > > fixes. > > Hmmm. Just for fun I decided to apply this series to see what would > happen, and on a 1k block filesystem shook out a test failure that seems > like it could be related? > I had reproduced this earlier on and eventually determined it was kind of circumstantial with respect to this series. I had eliminated some of the operations from generic/223 for more quick reproduction/RCA and ultimately reproduced the same behavior on upstream (at the time, which was probably a month or two ago by now) with a slightly different workload. That lead to the following series: https://marc.info/?l=linux-xfs&m=155671950608062&w=2 ... which IIRC addressed the problem in both scenarios. Thoughts on those patches? Brian > MKFS_OPTIONS='-f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf' > MOUNT_OPTIONS='-o usrquota,grpquota,prjquota /dev/sdf /opt' > > --D > > --- generic/223.out > +++ generic/223.out.bad > @@ -48,7 +48,7 @@ > === Testing size 1g falloc on 8k stripe === > SCRATCH_MNT/file-1g-falloc: well-aligned > === Testing size 1073745920 falloc on 8k stripe === > -SCRATCH_MNT/file-1073745920-falloc: well-aligned > +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 2 > === mkfs with su 4 blocks x 4 === > === Testing size 1*16k on 16k stripe === > SCRATCH_MNT/file-1-16384-falloc: well-aligned > @@ -98,7 +98,7 @@ > === Testing size 1g falloc on 16k stripe === > SCRATCH_MNT/file-1g-falloc: well-aligned > === Testing size 1073745920 falloc on 16k stripe === > -SCRATCH_MNT/file-1073745920-falloc: well-aligned > +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 4 > === mkfs with su 8 blocks x 4 === > === Testing size 1*32k on 32k stripe === > SCRATCH_MNT/file-1-32768-falloc: well-aligned > @@ -148,7 +148,7 @@ > === Testing size 1g falloc on 32k stripe === > SCRATCH_MNT/file-1g-falloc: well-aligned > === Testing size 1073745920 falloc on 32k stripe === > -SCRATCH_MNT/file-1073745920-falloc: well-aligned > +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 8 > === mkfs with su 16 blocks x 4 === > === Testing size 1*64k on 64k stripe === > SCRATCH_MNT/file-1-65536-falloc: well-aligned > @@ -198,7 +198,7 @@ > === Testing size 1g falloc on 64k stripe === > SCRATCH_MNT/file-1g-falloc: well-aligned > === Testing size 1073745920 falloc on 64k stripe === > -SCRATCH_MNT/file-1073745920-falloc: well-aligned > +SCRATCH_MNT/file-1073745920-falloc: Start block 197665 not multiple of sunit 16 > === mkfs with su 32 blocks x 4 === > === Testing size 1*128k on 128k stripe === > SCRATCH_MNT/file-1-131072-falloc: well-aligned