mbox series

[v2,00/11] xfs: rework extent allocation

Message ID 20190522180546.17063-1-bfoster@redhat.com (mailing list archive)
Headers show
Series xfs: rework extent allocation | expand

Message

Brian Foster May 22, 2019, 6:05 p.m. UTC
Hi all,

This is v2 of the extent allocation rework series. The changes in this
version are mostly associated with code factoring, based on feedback to
v1. The small mode helper refactoring has been isolated and pulled to
the start of the series. The active flag that necessitated the btree
cursor container structure has been pushed into the xfs_btree_cur
private area. The resulting high level allocation code in
xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
level of abstraction. Finally, there are various minor cleanups and
fixes.

On the testing front, I've run a couple more filebench oriented tests
since v1. The first is a high load, large filesystem, parallel file
write+fsync test to try and determine whether the modified near mode
allocation algorithm resulted in larger latencies in the common
(non-fragmented) case. The results show comparable latencies, though the
updated algorithm has a slightly faster overall runtime for whatever
reason.

The second is another filebench test (but with a smaller fileset against
a smaller filesystem), but with the purpose of measuring "locality
effectiveness" of the updated algorithm via post-test analysis of the
resulting/populated filesystem. I've been thinking a bit about how to
test for locality since starting on this series and ultimately came up
with the following, fairly crude heuristic: track and compare the worst
locality allocation for each regular file inode in the fs. This
essentially locates the most distant extent for each inode, tracks the
delta from that extent to the inode location on disk and calculates the
average worst case delta across the entire set of regular files. The
results show that the updated algorithm provides a comparable level of
locality to the existing algorithm. Note that this test also involves a
control run using a kernel modified to replace all near mode allocations
with by-size allocations. This is to show the effectiveness of near mode
allocation in general as compared to essentially random allocations.

Further details and results from both of these tests are appended below
the changelog. Thoughts, reviews, flames appreciated.

Brian

v2:
- Lift small mode refactoring into separate patch (retained review
  tag(s).
- Various logic cleanups and refactors.
- Push active flag down into btree cursor private area; eliminate cursor
  container struct.
- Refactor final allocation code. Fold xfs_alloc_ag_vextent_type() into
  caller and factor out accounting.                                                                                                                     
- Fix up tracepoints.
v1: https://marc.info/?l=linux-xfs&m=155742169729590&w=2
- Continued development (various fixes, refinements) on generic bits and
  near mode implementation.
- Added patches 4-6 to refactor exact, by-size and small allocation
  modes.
rfcv2: https://marc.info/?l=linux-xfs&m=155197946630582&w=2
- Dropped spurious initial refactoring.
- Added minlen functionality.
- Properly tied into near alloc path.
- General refactoring and cleanups.
rfcv1: https://marc.info/?l=linux-xfs&m=154479089914351&w=2

--- filebench 200 thread file create test, 20k files, 7TB 32AG fs:

- baseline summary results (3 runs):

 3541: 34498.540: Run took 34495 seconds...
 3541: 34498.547: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.0ms/op      318us/op-cpu [0ms - 5ms]
fsync1               19801ops        1ops/s   0.0mb/s    883.0ms/op   408529us/op-cpu [0ms - 106670ms]
writefile1           19801ops        1ops/s 152.5mb/s 322687.8ms/op 80040329us/op-cpu [0ms - 10344864ms]
createfile1          20000ops        1ops/s   0.0mb/s      1.0ms/op     8460us/op-cpu [0ms - 593ms]
 3541: 34498.548: IO Summary: 79403 ops, 2.302 ops/s, (0/1 r/w), 152.5mb/s, 174211us cpu/op, 323571.9ms latency

94799: 34457.516: Run took 34454 seconds...
94799: 34457.524: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.1ms/op      266us/op-cpu [0ms - 329ms]
fsync1               19801ops        1ops/s   0.0mb/s   1014.5ms/op   394296us/op-cpu [0ms - 188565ms]
writefile1           19801ops        1ops/s 152.7mb/s 322228.1ms/op 80961984us/op-cpu [0ms - 10354419ms]
createfile1          20000ops        1ops/s   0.0mb/s      0.9ms/op    10845us/op-cpu [0ms - 944ms]
94799: 34457.524: IO Summary: 79403 ops, 2.304 ops/s, (0/1 r/w), 152.7mb/s, 174494us cpu/op, 323243.6ms latency

70863: 34440.504: Run took 34437 seconds...
70863: 34440.513: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.0ms/op      322us/op-cpu [0ms - 43ms]
fsync1               19801ops        1ops/s   0.0mb/s    909.0ms/op   412324us/op-cpu [0ms - 135906ms]
writefile1           19801ops        1ops/s 152.8mb/s 322198.4ms/op 82080900us/op-cpu [0ms - 10368105ms]
createfile1          20000ops        1ops/s   0.0mb/s      1.0ms/op     9544us/op-cpu [0ms - 605ms]
70863: 34440.513: IO Summary: 79403 ops, 2.306 ops/s, (0/1 r/w), 152.8mb/s, 174420us cpu/op, 323108.4ms latency

- test summary results:

22975: 34215.483: Run took 34212 seconds...
22975: 34215.491: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.0ms/op      272us/op-cpu [0ms - 7ms]
fsync1               19801ops        1ops/s   0.0mb/s    914.5ms/op   383151us/op-cpu [0ms - 176244ms]
writefile1           19801ops        1ops/s 153.8mb/s 319896.9ms/op 75523888us/op-cpu [0ms - 10337439ms]
createfile1          20000ops        1ops/s   0.0mb/s      1.1ms/op    12534us/op-cpu [0ms - 628ms]
22975: 34215.491: IO Summary: 79403 ops, 2.321 ops/s, (0/1 r/w), 153.8mb/s, 322858us cpu/op, 320812.6ms latency

 3954: 34197.494: Run took 34194 seconds...
 3954: 34197.503: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.1ms/op      261us/op-cpu [0ms - 549ms]
fsync1               19801ops        1ops/s   0.0mb/s    808.8ms/op   376216us/op-cpu [0ms - 107815ms]
writefile1           19801ops        1ops/s 153.8mb/s 319904.9ms/op 86184221us/op-cpu [0ms - 10315909ms]
createfile1          20000ops        1ops/s   0.0mb/s      1.0ms/op     5954us/op-cpu [0ms - 591ms]
 3954: 34197.503: IO Summary: 79403 ops, 2.322 ops/s, (0/1 r/w), 153.8mb/s, 172877us cpu/op, 320714.8ms latency

180845: 34295.481: Run took 34292 seconds...
180845: 34295.489: Per-Operation Breakdown
closefile1           19801ops        1ops/s   0.0mb/s      0.1ms/op      306us/op-cpu [0ms - 330ms]
fsync1               19801ops        1ops/s   0.0mb/s    868.4ms/op   387699us/op-cpu [0ms - 159923ms]
writefile1           19801ops        1ops/s 153.4mb/s 320814.7ms/op 84496457us/op-cpu [0ms - 10314107ms]
createfile1          20000ops        1ops/s   0.0mb/s      1.4ms/op     6386us/op-cpu [0ms - 425ms]
180845: 34295.489: IO Summary: 79403 ops, 2.315 ops/s, (0/1 r/w), 153.4mb/s, 174443us cpu/op, 321684.6ms latency

--- filebench 200 thread file create test, 5k files, 2TB 64AG fs
    (agsize: 8388608 4k blocks)

Locality measured in 512b sectors.

- baseline	- min: 8  max: 568752250 median: 434794.5 mean: 11446328
- test		- min: 33 max: 568402234 median: 437405.5 mean: 11752963
- by-size only	- min: 33 max: 568593146 median: 784805   mean: 11912300

Brian Foster (11):
  xfs: clean up small allocation helper
  xfs: move small allocation helper
  xfs: skip small alloc cntbt logic on NULL cursor
  xfs: always update params on small allocation
  xfs: track active state of allocation btree cursors
  xfs: use locality optimized cntbt lookups for near mode allocations
  xfs: refactor exact extent allocation mode
  xfs: refactor by-size extent allocation mode
  xfs: replace small allocation logic with agfl only logic
  xfs: refactor successful AG allocation accounting code
  xfs: condense high level AG allocation functions

 fs/xfs/libxfs/xfs_alloc.c       | 1485 +++++++++++++------------------
 fs/xfs/libxfs/xfs_alloc_btree.c |    1 +
 fs/xfs/libxfs/xfs_btree.h       |    3 +
 fs/xfs/xfs_trace.h              |   44 +-
 4 files changed, 643 insertions(+), 890 deletions(-)

Comments

Dave Chinner May 23, 2019, 1:56 a.m. UTC | #1
On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> Hi all,
> 
> This is v2 of the extent allocation rework series. The changes in this
> version are mostly associated with code factoring, based on feedback to
> v1. The small mode helper refactoring has been isolated and pulled to
> the start of the series. The active flag that necessitated the btree
> cursor container structure has been pushed into the xfs_btree_cur
> private area. The resulting high level allocation code in
> xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> level of abstraction. Finally, there are various minor cleanups and
> fixes.
> 
> On the testing front, I've run a couple more filebench oriented tests
> since v1. The first is a high load, large filesystem, parallel file
> write+fsync test to try and determine whether the modified near mode
> allocation algorithm resulted in larger latencies in the common
> (non-fragmented) case. The results show comparable latencies, though the
> updated algorithm has a slightly faster overall runtime for whatever
> reason.

Probably indicative that over so many allocations, saving a few
microseconds of CPU time here and there adds up. That's also a fairly
good indication that the IO behaviour hasn't dramatically changed
between algorithms - we're not adding or removing a huge number of
seeks to the workload....

> The second is another filebench test (but with a smaller fileset against
> a smaller filesystem), but with the purpose of measuring "locality
> effectiveness" of the updated algorithm via post-test analysis of the
> resulting/populated filesystem. I've been thinking a bit about how to
> test for locality since starting on this series and ultimately came up
> with the following, fairly crude heuristic: track and compare the worst
> locality allocation for each regular file inode in the fs.

OK, that's pretty crude :P

> This
> essentially locates the most distant extent for each inode, tracks the
> delta from that extent to the inode location on disk and calculates the
> average worst case delta across the entire set of regular files. The
> results show that the updated algorithm provides a comparable level of
> locality to the existing algorithm.

The problem with this is that worse case locality isn't a
particularly useful measure. In general, when you have allocator
contention it occurs on the AGF locks and so the allocator skips to
the next AG it can lock. That means if we have 32 AGs and 33
allocations in progress at once, the AG that it chosen for
allocation is going to be essentially random. This means worst case
allocation locality is always going to be "ag skip" distances and so
the jumps between AGs are going to largely dominate the measured
locality distances.

In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be
around 220 * 2^30 / 2^9 = ~460m sectors and:

> - baseline	- min: 8  max: 568752250 median: 434794.5 mean: 11446328
> - test		- min: 33 max: 568402234 median: 437405.5 mean: 11752963
> - by-size only	- min: 33 max: 568593146 median: 784805   mean: 11912300

max are all >460m sectors and so are AG skip distances.

However, the changes you've made affect locality for allocations
_within_ an AG, not across the filesystem, and so anything that
skips to another AG really needs to be measured differently.

i.e. what we really need to measure here is "how close to target did
we get?" and for extending writes the target is always the AGBNO of
the end of the last extent.

The inode itself is only used as the target for the first extent, so
using it as the only distance comparison ignores the fact we try to
allocate as close to the end of the last extent as possible, not as
close to the inode as possible. Hence once a file has jumped AG, it
will stay in the new AG and not return to the original AG the inode
is in. This means that once the file data changes locality, it tries
to keep that same locality for the next data that is written, not
force another seek back to the original location.

So, AFAICT, the measure of locality we should be using to evaluate
the impact to locality of the new algorithm is the distance between
sequential extents in a file allocated within the same AG, not the
worst case distance from the inode....

Cheers,

Dave.

(*) Which, in reality, we really should reset because once we jump
AG we have no locality target and so should allow the full AG to be
considered. This "didn't reset target" issue is something I suspect
leads to the infamous "backwards allocation for sequential writes"
problems...
Brian Foster May 23, 2019, 12:55 p.m. UTC | #2
On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote:
> On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is v2 of the extent allocation rework series. The changes in this
> > version are mostly associated with code factoring, based on feedback to
> > v1. The small mode helper refactoring has been isolated and pulled to
> > the start of the series. The active flag that necessitated the btree
> > cursor container structure has been pushed into the xfs_btree_cur
> > private area. The resulting high level allocation code in
> > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> > level of abstraction. Finally, there are various minor cleanups and
> > fixes.
> > 
> > On the testing front, I've run a couple more filebench oriented tests
> > since v1. The first is a high load, large filesystem, parallel file
> > write+fsync test to try and determine whether the modified near mode
> > allocation algorithm resulted in larger latencies in the common
> > (non-fragmented) case. The results show comparable latencies, though the
> > updated algorithm has a slightly faster overall runtime for whatever
> > reason.
> 
> Probably indicative that over so many allocations, saving a few
> microseconds of CPU time here and there adds up. That's also a fairly
> good indication that the IO behaviour hasn't dramatically changed
> between algorithms - we're not adding or removing a huge number of
> seeks to the workload....
> 

Makes sense. The goal here (and purpose for the higher level testing) is
basically to confirm this doesn't break/regress the common
(non-fragmented) allocation scenario. I suppose that's a bonus if we
find some incremental speedups along the way..

> > The second is another filebench test (but with a smaller fileset against
> > a smaller filesystem), but with the purpose of measuring "locality
> > effectiveness" of the updated algorithm via post-test analysis of the
> > resulting/populated filesystem. I've been thinking a bit about how to
> > test for locality since starting on this series and ultimately came up
> > with the following, fairly crude heuristic: track and compare the worst
> > locality allocation for each regular file inode in the fs.
> 
> OK, that's pretty crude :P
> 

Yeah, this was just a start and I figured it might generate some
feedback... ;)

> > This
> > essentially locates the most distant extent for each inode, tracks the
> > delta from that extent to the inode location on disk and calculates the
> > average worst case delta across the entire set of regular files. The
> > results show that the updated algorithm provides a comparable level of
> > locality to the existing algorithm.
> 
> The problem with this is that worse case locality isn't a
> particularly useful measure. In general, when you have allocator
> contention it occurs on the AGF locks and so the allocator skips to
> the next AG it can lock. That means if we have 32 AGs and 33
> allocations in progress at once, the AG that it chosen for
> allocation is going to be essentially random. This means worst case
> allocation locality is always going to be "ag skip" distances and so
> the jumps between AGs are going to largely dominate the measured
> locality distances.
> 

Good point. I was thinking about rerunning this with agcount=1 (with a
lighter workload) to isolate the analysis to a single AG, but wanted to
get this on the list for feedback given the time it takes populate the
fs and whatnot.

> In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be
> around 220 * 2^30 / 2^9 = ~460m sectors and:
> 
> > - baseline	- min: 8  max: 568752250 median: 434794.5 mean: 11446328
> > - test		- min: 33 max: 568402234 median: 437405.5 mean: 11752963
> > - by-size only	- min: 33 max: 568593146 median: 784805   mean: 11912300
> 
> max are all >460m sectors and so are AG skip distances.
> 

Though it is interesting that the average and median are within that
~460m sector delta. I think that means this is at least catching some
information on intra-AG locality as many of these files might not have
had to jump AGs. Taking a look at the dataset, this could be because the
majority of these files are on the smaller side and consist of one or
two extents. There's also plenty of RAM (64GB) on the box I happened to
use.

For reference, the file sizes in the set are defined by the following
filebench mapping:

{{50, 4k, 1m},
 {35, 10m, 100m},
 {10, 500m, 1g},
 {5, 1g, 8g}
}

... where the first value is the percentage and the next two are a size
range. E.g., 50% of files are 4k-1m, 35% are 10m-100m, etc. I can
obviously tweak this however necessary to provide most useful results or
test different distributions.

But while I don't think the current number is completely bogus, I agree
that it's diluted by those larger files that do happen to jump AGs and
thus it is probably missing critical information about file extension
allocation locality.

> However, the changes you've made affect locality for allocations
> _within_ an AG, not across the filesystem, and so anything that
> skips to another AG really needs to be measured differently.
> 
> i.e. what we really need to measure here is "how close to target did
> we get?" and for extending writes the target is always the AGBNO of
> the end of the last extent.
> 
> The inode itself is only used as the target for the first extent, so
> using it as the only distance comparison ignores the fact we try to
> allocate as close to the end of the last extent as possible, not as
> close to the inode as possible. Hence once a file has jumped AG, it
> will stay in the new AG and not return to the original AG the inode
> is in. This means that once the file data changes locality, it tries
> to keep that same locality for the next data that is written, not
> force another seek back to the original location.
> 

Ok, I was thinking that locality is always based on inode location. I
see that xfs_bmap_btalloc() assigns ap->blkno (which makes its way to
args.fsbno/agbno) based on the inode, but I missed the
xfs_bmap_adjacent() call right after that which overrides ap->blkno
based on the previous extent, if applicable.

So this means that the worst case locality based on inode location is
actually invalid for (sequentially written) multi-extent files because
once we create a discontiguity, we're not measuring against the locality
target that was actually used by the algorithm (even within the same
AG).

> So, AFAICT, the measure of locality we should be using to evaluate
> the impact to locality of the new algorithm is the distance between
> sequential extents in a file allocated within the same AG, not the
> worst case distance from the inode....
> 

Yeah, makes sense. I created metadumps of each filesystem created for
this test in anticipation of tweaks to the post-processing heuristic.

I assume we still want to fold this measurement up into some mean/median
locality value for the overall filesystem for comparison purposes, but
how would you prefer to see that tracked on a per-inode basis? I could
still track the worst case stride from one extent to the next within an
inode (provided they sit in the same AG), or we could do something like
track the average stride for each inode, or average stride in general
across all intra-AG extents. Hmmm.. I suppose if I had a script that
just dumped every applicable stride/delta value for an inode, I could
dump all of those numbers into a file and we can process it from there..

I'll play around with this some more. Thanks for the feedback!

> Cheers,
> 
> Dave.
> 
> (*) Which, in reality, we really should reset because once we jump
> AG we have no locality target and so should allow the full AG to be
> considered. This "didn't reset target" issue is something I suspect
> leads to the infamous "backwards allocation for sequential writes"
> problems...
> 

I think this is something that's handled at a higher level. In the
nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode
which is what allows us to iterate AGs. We start with a near mode
allocation in the target AG. If that fails, xfs_alloc_vextent() switches
over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in
xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set.

Brian

> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner May 23, 2019, 10:15 p.m. UTC | #3
On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote:
> > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > This is v2 of the extent allocation rework series. The changes in this
> > > version are mostly associated with code factoring, based on feedback to
> > > v1. The small mode helper refactoring has been isolated and pulled to
> > > the start of the series. The active flag that necessitated the btree
> > > cursor container structure has been pushed into the xfs_btree_cur
> > > private area. The resulting high level allocation code in
> > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> > > level of abstraction. Finally, there are various minor cleanups and
> > > fixes.
> > > 
> > > On the testing front, I've run a couple more filebench oriented tests
> > > since v1. The first is a high load, large filesystem, parallel file
> > > write+fsync test to try and determine whether the modified near mode
> > > allocation algorithm resulted in larger latencies in the common
> > > (non-fragmented) case. The results show comparable latencies, though the
> > > updated algorithm has a slightly faster overall runtime for whatever
> > > reason.
> > 
> > Probably indicative that over so many allocations, saving a few
> > microseconds of CPU time here and there adds up. That's also a fairly
> > good indication that the IO behaviour hasn't dramatically changed
> > between algorithms - we're not adding or removing a huge number of
> > seeks to the workload....
> > 
> 
> Makes sense. The goal here (and purpose for the higher level testing) is
> basically to confirm this doesn't break/regress the common
> (non-fragmented) allocation scenario. I suppose that's a bonus if we
> find some incremental speedups along the way..

*nod*

[snip]


> > In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be
> > around 220 * 2^30 / 2^9 = ~460m sectors and:
> > 
> > > - baseline	- min: 8  max: 568752250 median: 434794.5 mean: 11446328
> > > - test		- min: 33 max: 568402234 median: 437405.5 mean: 11752963
> > > - by-size only	- min: 33 max: 568593146 median: 784805   mean: 11912300
> > 
> > max are all >460m sectors and so are AG skip distances.
> > 
> 
> Though it is interesting that the average and median are within that
> ~460m sector delta. I think that means this is at least catching some
> information on intra-AG locality as many of these files might not have
> had to jump AGs.

[....]

> But while I don't think the current number is completely bogus, I agree
> that it's diluted by those larger files that do happen to jump AGs and
> thus it is probably missing critical information about file extension
> allocation locality.

*nod*

[....]

> > So, AFAICT, the measure of locality we should be using to evaluate
> > the impact to locality of the new algorithm is the distance between
> > sequential extents in a file allocated within the same AG, not the
> > worst case distance from the inode....
> > 
> 
> Yeah, makes sense. I created metadumps of each filesystem created for
> this test in anticipation of tweaks to the post-processing heuristic.

Nice :)

> I assume we still want to fold this measurement up into some mean/median
> locality value for the overall filesystem for comparison purposes, but
> how would you prefer to see that tracked on a per-inode basis? I could
> still track the worst case stride from one extent to the next within an
> inode (provided they sit in the same AG), or we could do something like
> track the average stride for each inode, or average stride in general
> across all intra-AG extents.

Histograms are probably the best way to demonstrate the general
distribution of the data set. Average/median don't really convey
much other than whether the dataset is skewed to one end or another,
while histograms will give us a good indication in the change of
"distance from target" over a wide range of the filesystem. It can
even incorporate "skipped AG" as a bucket to indicate how often
thatis occurring and whether the change in algorithms
increase/decreases the occurence of that.

FWIW, I suspect a histogram that describes "extent size vs locality
of next extent" will give us an idea of whether small or large
allocations have worse locality. I'd also expect a measure like this
to give insight into how badly free space fragmentation is affecting
the filesystem, as it will tend to push the extent size distribution
towards the smaller end of the scale....

> Hmmm.. I suppose if I had a script that
> just dumped every applicable stride/delta value for an inode, I could
> dump all of those numbers into a file and we can process it from there..

See how the freesp commands work in xfs_db - they just generate a
set of {offset, size} tuples that are then bucketted appropriately.
This is probably the best way to do this at the moment - have xfs_db
walk the inode BMBTs outputting something like {extent size,
distance to next extent} tuples and everything else falls out from
how we bucket that information.


> > (*) Which, in reality, we really should reset because once we jump
> > AG we have no locality target and so should allow the full AG to be
> > considered. This "didn't reset target" issue is something I suspect
> > leads to the infamous "backwards allocation for sequential writes"
> > problems...
> 
> I think this is something that's handled at a higher level. In the
> nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode
> which is what allows us to iterate AGs.

The nullfb case isn't the problem here - I think the problem comes
when we fail the THIS_BNO/NEAR_BNO allocation attempts in
xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO
with the same fsbno target.

> We start with a near mode
> allocation in the target AG. If that fails, xfs_alloc_vextent() switches
> over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in
> xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set.

Hmmm. I think you just pointed out a bug in this code. The initial
NEARBNO alloc sets ->min_agbno and ->max_agbno based on the
args->agbno, which xfs_alloc_compute_aligned then uses to determine
if the found free extent can be used. When we immediately fall back
to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset
min_agbno/max_agbno, and so even though we ignore the args->agbno
target that is set when selecting a free extent, we still call
xfs_alloc_compute_aligned() to determine if it is appropriate to
use.

I think the bug here is that xfs_alloc_compute_aligned() applies the
NEARBNO args->min_agbno to the free extent found by the THIS_AG
allocation, and so if the large free extent in the new AG overlaps
min_agbno it selects the higher up part of the free extent that
starts at min_agbno, rather than using the start of the free
extent...

Of course, I could be missing something obvious (still haven't
finished my morning coffee!), so it's worth checking I read the code
properly....

Cheers,

Dave.
Brian Foster May 24, 2019, noon UTC | #4
On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > On Thu, May 23, 2019 at 11:56:59AM +1000, Dave Chinner wrote:
> > > On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > This is v2 of the extent allocation rework series. The changes in this
> > > > version are mostly associated with code factoring, based on feedback to
> > > > v1. The small mode helper refactoring has been isolated and pulled to
> > > > the start of the series. The active flag that necessitated the btree
> > > > cursor container structure has been pushed into the xfs_btree_cur
> > > > private area. The resulting high level allocation code in
> > > > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> > > > level of abstraction. Finally, there are various minor cleanups and
> > > > fixes.
> > > > 
> > > > On the testing front, I've run a couple more filebench oriented tests
> > > > since v1. The first is a high load, large filesystem, parallel file
> > > > write+fsync test to try and determine whether the modified near mode
> > > > allocation algorithm resulted in larger latencies in the common
> > > > (non-fragmented) case. The results show comparable latencies, though the
> > > > updated algorithm has a slightly faster overall runtime for whatever
> > > > reason.
> > > 
> > > Probably indicative that over so many allocations, saving a few
> > > microseconds of CPU time here and there adds up. That's also a fairly
> > > good indication that the IO behaviour hasn't dramatically changed
> > > between algorithms - we're not adding or removing a huge number of
> > > seeks to the workload....
> > > 
> > 
> > Makes sense. The goal here (and purpose for the higher level testing) is
> > basically to confirm this doesn't break/regress the common
> > (non-fragmented) allocation scenario. I suppose that's a bonus if we
> > find some incremental speedups along the way..
> 
> *nod*
> 
> [snip]
> 
> 
> > > In this case, 7TB, 32AGs = ~220GB per AG, so an AG skip will be
> > > around 220 * 2^30 / 2^9 = ~460m sectors and:
> > > 
> > > > - baseline	- min: 8  max: 568752250 median: 434794.5 mean: 11446328
> > > > - test		- min: 33 max: 568402234 median: 437405.5 mean: 11752963
> > > > - by-size only	- min: 33 max: 568593146 median: 784805   mean: 11912300
> > > 
> > > max are all >460m sectors and so are AG skip distances.
> > > 
> > 
> > Though it is interesting that the average and median are within that
> > ~460m sector delta. I think that means this is at least catching some
> > information on intra-AG locality as many of these files might not have
> > had to jump AGs.
> 
> [....]
> 
> > But while I don't think the current number is completely bogus, I agree
> > that it's diluted by those larger files that do happen to jump AGs and
> > thus it is probably missing critical information about file extension
> > allocation locality.
> 
> *nod*
> 
> [....]
> 
> > > So, AFAICT, the measure of locality we should be using to evaluate
> > > the impact to locality of the new algorithm is the distance between
> > > sequential extents in a file allocated within the same AG, not the
> > > worst case distance from the inode....
> > > 
> > 
> > Yeah, makes sense. I created metadumps of each filesystem created for
> > this test in anticipation of tweaks to the post-processing heuristic.
> 
> Nice :)
> 
> > I assume we still want to fold this measurement up into some mean/median
> > locality value for the overall filesystem for comparison purposes, but
> > how would you prefer to see that tracked on a per-inode basis? I could
> > still track the worst case stride from one extent to the next within an
> > inode (provided they sit in the same AG), or we could do something like
> > track the average stride for each inode, or average stride in general
> > across all intra-AG extents.
> 
> Histograms are probably the best way to demonstrate the general
> distribution of the data set. Average/median don't really convey
> much other than whether the dataset is skewed to one end or another,
> while histograms will give us a good indication in the change of
> "distance from target" over a wide range of the filesystem. It can
> even incorporate "skipped AG" as a bucket to indicate how often
> thatis occurring and whether the change in algorithms
> increase/decreases the occurence of that.
> 
> FWIW, I suspect a histogram that describes "extent size vs locality
> of next extent" will give us an idea of whether small or large
> allocations have worse locality. I'd also expect a measure like this
> to give insight into how badly free space fragmentation is affecting
> the filesystem, as it will tend to push the extent size distribution
> towards the smaller end of the scale....
> 

Ok. I actually attempted a histogram of the current dataset but the
random CLI utility I found didn't quite spit out what I expected. I was
probably using it wrong, but didn't dig much further into it..

> > Hmmm.. I suppose if I had a script that
> > just dumped every applicable stride/delta value for an inode, I could
> > dump all of those numbers into a file and we can process it from there..
> 
> See how the freesp commands work in xfs_db - they just generate a
> set of {offset, size} tuples that are then bucketted appropriately.
> This is probably the best way to do this at the moment - have xfs_db
> walk the inode BMBTs outputting something like {extent size,
> distance to next extent} tuples and everything else falls out from
> how we bucket that information.
> 

That sounds plausible. A bit more involved than what I'm currently
working with, but we do already have a blueprint for the scanning
implementation required to collect this data via the frag command.
Perhaps some of this code between the frag/freesp can be generalized and
reused. I'll take a closer look at it.

My only concern is I'd prefer to only go down this path as long as we
plan to land the associated command in xfs_db. So this approach suggests
to me that we add a "locality" command similar to frag/freesp that
presents the locality state of the fs. For now I'm only really concerned
with the data associated with known near mode allocations (i.e., such as
the extent size:distance relationship you've outlined above) so we can
evaluate these algorithmic changes, but this would be for fs devs only
so we could always expand on it down the road if we want to assess
different allocations. Hm?

> 
> > > (*) Which, in reality, we really should reset because once we jump
> > > AG we have no locality target and so should allow the full AG to be
> > > considered. This "didn't reset target" issue is something I suspect
> > > leads to the infamous "backwards allocation for sequential writes"
> > > problems...
> > 
> > I think this is something that's handled at a higher level. In the
> > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode
> > which is what allows us to iterate AGs.
> 
> The nullfb case isn't the problem here - I think the problem comes
> when we fail the THIS_BNO/NEAR_BNO allocation attempts in
> xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO
> with the same fsbno target.
> 

Ok, that's the retry where we restore minlen from maxlen back to 1. Note
that still depends on nullfb, but that's just context. I think the
majority of the time data allocations come into the allocator as the
first allocation in the transaction.

That aside, I've also seen that this particular retry is hotter than
you'd think just by glancing at the code due to the whole minlen =
maxlen thing in xfs_bmap_select_minlen() combined with potential size of
delalloc extents. See the patches I posted a couple weeks ago in this
area for example:

  https://marc.info/?l=linux-xfs&m=155671950608062&w=2

... though I don't think they relate to the issue you're tracking here.

> > We start with a near mode
> > allocation in the target AG. If that fails, xfs_alloc_vextent() switches
> > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in
> > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set.
> 
> Hmmm. I think you just pointed out a bug in this code. The initial
> NEARBNO alloc sets ->min_agbno and ->max_agbno based on the
> args->agbno, which xfs_alloc_compute_aligned then uses to determine
> if the found free extent can be used. When we immediately fall back
> to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset
> min_agbno/max_agbno, and so even though we ignore the args->agbno
> target that is set when selecting a free extent, we still call
> xfs_alloc_compute_aligned() to determine if it is appropriate to
> use.
> 

I don't see where the near mode alloc assigns ->[min|max]_agbno. The
only place I see ->min_agbno used is for sparse inode chunk allocation.
->max_agbno is assigned in the near mode alloc when min == 0, but it's
set to agsize so I'm guessing that this is just an initialization so the
checking logic works with zeroed out values from the caller.

Just to clarify.. what exactly is the reverse allocation problem you're
thinking about here? Logically increasing writes resulting in physically
decreasing allocations? If so, have we confirmed that writes are indeed
sequential when this occurs (obvious I guess, but worth asking).

I'm not really familiar with this problem so I've not thought about it,
but one random thought comes to mind: is there any chance of allocation
interlocking behind extent frees contributing to this behavior? We do
truncate files in reverse order an extent at a time and immediately
finish the dfops (to free the extent and busy it) before we unmap the
next piece. If allocations could interlock with such frees (or busy
clearing), I could see in theory how that might result in backwards
allocations, but I'm just handwaving and not sure that's possible in
practice.

Brian

> I think the bug here is that xfs_alloc_compute_aligned() applies the
> NEARBNO args->min_agbno to the free extent found by the THIS_AG
> allocation, and so if the large free extent in the new AG overlaps
> min_agbno it selects the higher up part of the free extent that
> starts at min_agbno, rather than using the start of the free
> extent...
> 
> Of course, I could be missing something obvious (still haven't
> finished my morning coffee!), so it's worth checking I read the code
> properly....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Dave Chinner May 25, 2019, 10:43 p.m. UTC | #5
On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote:
> On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > > Hmmm.. I suppose if I had a script that
> > > just dumped every applicable stride/delta value for an inode, I could
> > > dump all of those numbers into a file and we can process it from there..
> > 
> > See how the freesp commands work in xfs_db - they just generate a
> > set of {offset, size} tuples that are then bucketted appropriately.
> > This is probably the best way to do this at the moment - have xfs_db
> > walk the inode BMBTs outputting something like {extent size,
> > distance to next extent} tuples and everything else falls out from
> > how we bucket that information.
> > 
> 
> That sounds plausible. A bit more involved than what I'm currently
> working with, but we do already have a blueprint for the scanning
> implementation required to collect this data via the frag command.
> Perhaps some of this code between the frag/freesp can be generalized and
> reused. I'll take a closer look at it.
> 
> My only concern is I'd prefer to only go down this path as long as we
> plan to land the associated command in xfs_db. So this approach suggests
> to me that we add a "locality" command similar to frag/freesp that
> presents the locality state of the fs. For now I'm only really concerned
> with the data associated with known near mode allocations (i.e., such as
> the extent size:distance relationship you've outlined above) so we can
> evaluate these algorithmic changes, but this would be for fs devs only
> so we could always expand on it down the road if we want to assess
> different allocations. Hm?

Yup, I'm needing to do similar analysis myself to determine how
quickly I'm aging the filesystem, so having the tool in xfs_db or
xfs_spaceman would be very useful.

FWIW, the tool I've just started writing will just use fallocate and
truncate to hammer the allocation code as hard and as quickly as
possible - I want to do accelerated aging of the filesystem, and so
being able to run tens to hundreds of thousands of free space
manipulations a second is the goal here....

> > > > (*) Which, in reality, we really should reset because once we jump
> > > > AG we have no locality target and so should allow the full AG to be
> > > > considered. This "didn't reset target" issue is something I suspect
> > > > leads to the infamous "backwards allocation for sequential writes"
> > > > problems...
> > > 
> > > I think this is something that's handled at a higher level. In the
> > > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode
> > > which is what allows us to iterate AGs.
> > 
> > The nullfb case isn't the problem here - I think the problem comes
> > when we fail the THIS_BNO/NEAR_BNO allocation attempts in
> > xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO
> > with the same fsbno target.
> > 
> 
> Ok, that's the retry where we restore minlen from maxlen back to 1. Note
> that still depends on nullfb, but that's just context. I think the
> majority of the time data allocations come into the allocator as the
> first allocation in the transaction.
> 
> That aside, I've also seen that this particular retry is hotter than
> you'd think just by glancing at the code due to the whole minlen =
> maxlen thing in xfs_bmap_select_minlen() combined with potential size of
> delalloc extents. See the patches I posted a couple weeks ago in this
> area for example:
> 
>   https://marc.info/?l=linux-xfs&m=155671950608062&w=2
> 
> ... though I don't think they relate to the issue you're tracking here.
> 
> > > We start with a near mode
> > > allocation in the target AG. If that fails, xfs_alloc_vextent() switches
> > > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in
> > > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set.
> > 
> > Hmmm. I think you just pointed out a bug in this code. The initial
> > NEARBNO alloc sets ->min_agbno and ->max_agbno based on the
> > args->agbno, which xfs_alloc_compute_aligned then uses to determine
> > if the found free extent can be used. When we immediately fall back
> > to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset
> > min_agbno/max_agbno, and so even though we ignore the args->agbno
> > target that is set when selecting a free extent, we still call
> > xfs_alloc_compute_aligned() to determine if it is appropriate to
> > use.
> > 
> 
> I don't see where the near mode alloc assigns ->[min|max]_agbno. The
> only place I see ->min_agbno used is for sparse inode chunk allocation.
> ->max_agbno is assigned in the near mode alloc when min == 0, but it's
> set to agsize so I'm guessing that this is just an initialization so the
> checking logic works with zeroed out values from the caller.

I did say:

> > Of course, I could be missing something obvious (still haven't
> > finished my morning coffee!), so it's worth checking I read the code
> > properly....

And that's clearly the case here - I read the code that clamps agbno
to min/max backwards....

> Just to clarify.. what exactly is the reverse allocation problem you're
> thinking about here? Logically increasing writes resulting in physically
> decreasing allocations? If so, have we confirmed that writes are indeed
> sequential when this occurs (obvious I guess, but worth asking).

Precisely this. And, yes, I've confirmed many times that it is
sequential writes that display it. It happens when the closest nearest free
extent is below the current target. i.e.:


+FFFFFFFFFFFFF+aaaaaaaT+bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb+FFFFF+

We just allocated "a" so the next allocation target is T, but "b" is
already allocated there. The freespace before "a" is larger than
the free space after "b" (which may be less than maxlen), and so the
"nearest best free space" is below the target. If "b" is large
enough, we end up with file allocation that looks like (where n =
relative file offset for the given allocation)

     +10+9+8+7+6+5+4+1+2+3+

i.e. allocation starts forward, uses the remaining ascending free
space in the extent, then starts allocating backwards as it takes
the nearest free space and trims off the top of the free extent for
the allocation.

> I'm not really familiar with this problem so I've not thought about it,
> but one random thought comes to mind: is there any chance of allocation
> interlocking behind extent frees contributing to this behavior? We do
> truncate files in reverse order an extent at a time and immediately
> finish the dfops (to free the extent and busy it) before we unmap the
> next piece.

Unlikely, because such extents are not immediately avaiable for
allocation (they are busy). And I've seen it on pure contended
sequential write workloads, too...

> If allocations could interlock with such frees (or busy
> clearing), I could see in theory how that might result in backwards
> allocations, but I'm just handwaving and not sure that's possible in
> practice.

Most of the cases I've seen have had the same symptom - "skip to
next AG, allocate at same high-up-in AGBNO target as the previous AG
wanted, then allocate backwards in the same AG until freespace
extent is exhausted. It then skips to some other freespace extent,
and depending on whether it's a forwards or backwards skip the
problem either goes away or continues. This is not a new behaviour,
I first saw it some 15 years ago, but I've never been able to
provoke it reliably enough with test code to get to the root
cause...

Cheers,

Dave.
Brian Foster May 31, 2019, 5:11 p.m. UTC | #6
On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote:
> > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > > > Hmmm.. I suppose if I had a script that
> > > > just dumped every applicable stride/delta value for an inode, I could
> > > > dump all of those numbers into a file and we can process it from there..
> > > 
> > > See how the freesp commands work in xfs_db - they just generate a
> > > set of {offset, size} tuples that are then bucketted appropriately.
> > > This is probably the best way to do this at the moment - have xfs_db
> > > walk the inode BMBTs outputting something like {extent size,
> > > distance to next extent} tuples and everything else falls out from
> > > how we bucket that information.
> > > 
> > 
> > That sounds plausible. A bit more involved than what I'm currently
> > working with, but we do already have a blueprint for the scanning
> > implementation required to collect this data via the frag command.
> > Perhaps some of this code between the frag/freesp can be generalized and
> > reused. I'll take a closer look at it.
> > 
> > My only concern is I'd prefer to only go down this path as long as we
> > plan to land the associated command in xfs_db. So this approach suggests
> > to me that we add a "locality" command similar to frag/freesp that
> > presents the locality state of the fs. For now I'm only really concerned
> > with the data associated with known near mode allocations (i.e., such as
> > the extent size:distance relationship you've outlined above) so we can
> > evaluate these algorithmic changes, but this would be for fs devs only
> > so we could always expand on it down the road if we want to assess
> > different allocations. Hm?
> 
> Yup, I'm needing to do similar analysis myself to determine how
> quickly I'm aging the filesystem, so having the tool in xfs_db or
> xfs_spaceman would be very useful.
> 
> FWIW, the tool I've just started writing will just use fallocate and
> truncate to hammer the allocation code as hard and as quickly as
> possible - I want to do accelerated aging of the filesystem, and so
> being able to run tens to hundreds of thousands of free space
> manipulations a second is the goal here....
> 

Ok. FWIW, from playing with this so far (before getting distracted for
much of this week) the most straightforward place to add this kind of
functionality turns out to be the frag command itself. It does 99% of
the work required to process data extents already, including pulling the
on-disk records of each inode in-core for processing. I basically just
had to update that code to include all of the record data and add the
locality tracking logic (I haven't got to actually presenting it yet..).

> > > > > (*) Which, in reality, we really should reset because once we jump
> > > > > AG we have no locality target and so should allow the full AG to be
> > > > > considered. This "didn't reset target" issue is something I suspect
> > > > > leads to the infamous "backwards allocation for sequential writes"
> > > > > problems...
> > > > 
> > > > I think this is something that's handled at a higher level. In the
> > > > nullfb case at least, we use the XFS_ALLOCTYPE_START_BNO allocation mode
> > > > which is what allows us to iterate AGs.
> > > 
> > > The nullfb case isn't the problem here - I think the problem comes
> > > when we fail the THIS_BNO/NEAR_BNO allocation attempts in
> > > xfs_bmap_btalloc() and then fall back to XFS_ALLOCTYPE_START_BNO
> > > with the same fsbno target.
> > > 
> > 
> > Ok, that's the retry where we restore minlen from maxlen back to 1. Note
> > that still depends on nullfb, but that's just context. I think the
> > majority of the time data allocations come into the allocator as the
> > first allocation in the transaction.
> > 
> > That aside, I've also seen that this particular retry is hotter than
> > you'd think just by glancing at the code due to the whole minlen =
> > maxlen thing in xfs_bmap_select_minlen() combined with potential size of
> > delalloc extents. See the patches I posted a couple weeks ago in this
> > area for example:
> > 
> >   https://marc.info/?l=linux-xfs&m=155671950608062&w=2
> > 
> > ... though I don't think they relate to the issue you're tracking here.
> > 
> > > > We start with a near mode
> > > > allocation in the target AG. If that fails, xfs_alloc_vextent() switches
> > > > over to XFS_ALLOCTYPE_THIS_AG for the remaining AGs. This lands in
> > > > xfs_alloc_ag_vextent_size() which ignores args.agbno even if it's set.
> > > 
> > > Hmmm. I think you just pointed out a bug in this code. The initial
> > > NEARBNO alloc sets ->min_agbno and ->max_agbno based on the
> > > args->agbno, which xfs_alloc_compute_aligned then uses to determine
> > > if the found free extent can be used. When we immediately fall back
> > > to XFS_ALLOCTYPE_THIS_AG allocation, we don't reset
> > > min_agbno/max_agbno, and so even though we ignore the args->agbno
> > > target that is set when selecting a free extent, we still call
> > > xfs_alloc_compute_aligned() to determine if it is appropriate to
> > > use.
> > > 
> > 
> > I don't see where the near mode alloc assigns ->[min|max]_agbno. The
> > only place I see ->min_agbno used is for sparse inode chunk allocation.
> > ->max_agbno is assigned in the near mode alloc when min == 0, but it's
> > set to agsize so I'm guessing that this is just an initialization so the
> > checking logic works with zeroed out values from the caller.
> 
> I did say:
> 
> > > Of course, I could be missing something obvious (still haven't
> > > finished my morning coffee!), so it's worth checking I read the code
> > > properly....
> 
> And that's clearly the case here - I read the code that clamps agbno
> to min/max backwards....
> 

Ok..

> > Just to clarify.. what exactly is the reverse allocation problem you're
> > thinking about here? Logically increasing writes resulting in physically
> > decreasing allocations? If so, have we confirmed that writes are indeed
> > sequential when this occurs (obvious I guess, but worth asking).
> 
> Precisely this. And, yes, I've confirmed many times that it is
> sequential writes that display it. It happens when the closest nearest free
> extent is below the current target. i.e.:
> 
> 
> +FFFFFFFFFFFFF+aaaaaaaT+bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb+FFFFF+
> 
> We just allocated "a" so the next allocation target is T, but "b" is
> already allocated there. The freespace before "a" is larger than
> the free space after "b" (which may be less than maxlen), and so the
> "nearest best free space" is below the target. If "b" is large
> enough, we end up with file allocation that looks like (where n =
> relative file offset for the given allocation)
> 
>      +10+9+8+7+6+5+4+1+2+3+
> 
> i.e. allocation starts forward, uses the remaining ascending free
> space in the extent, then starts allocating backwards as it takes
> the nearest free space and trims off the top of the free extent for
> the allocation.
> 

Hmm, so are you saying that the allocation is not only physically
reversed, but the (physically out of order) ascending logical extents in
the file are actually physically contiguous as well? IOW, if freed, they
would form a single/contiguous free extent?

If so, any idea how common that pattern is to the instances of this
problem you've seen? I.e., is it always the case, sometimes the case..?

Also, any intuition on the size of these extents relative to the file?

> > I'm not really familiar with this problem so I've not thought about it,
> > but one random thought comes to mind: is there any chance of allocation
> > interlocking behind extent frees contributing to this behavior? We do
> > truncate files in reverse order an extent at a time and immediately
> > finish the dfops (to free the extent and busy it) before we unmap the
> > next piece.
> 
> Unlikely, because such extents are not immediately avaiable for
> allocation (they are busy). And I've seen it on pure contended
> sequential write workloads, too...
> 

I was thinking that unbusying of extents could be similarly interlocked
across log buffer completions, but your latter point and the additional
context suggests this is more likely influenced by a combination
allocation timing (i.e., AGF locking to determine when we jump AGs) and
free space state than anything like this.

> > If allocations could interlock with such frees (or busy
> > clearing), I could see in theory how that might result in backwards
> > allocations, but I'm just handwaving and not sure that's possible in
> > practice.
> 
> Most of the cases I've seen have had the same symptom - "skip to
> next AG, allocate at same high-up-in AGBNO target as the previous AG
> wanted, then allocate backwards in the same AG until freespace
> extent is exhausted. It then skips to some other freespace extent,
> and depending on whether it's a forwards or backwards skip the
> problem either goes away or continues. This is not a new behaviour,
> I first saw it some 15 years ago, but I've never been able to
> provoke it reliably enough with test code to get to the root
> cause...
> 

I guess the biggest question to me is whether we're more looking for a
backwards searching pattern or a pattern where we split up a larger free
extent into smaller chunks (in reverse), or both. I can definitely see
some isolated areas where a backwards search could lead to this
behavior. E.g., my previous experiment to replace near mode allocs with
size mode allocs always allocates in reverse when free space is
sufficiently fragmented. To see this in practice would require repeated
size mode allocations, however, which I think is unlikely because once
we jump AGs and do a size mode alloc, the subsequent allocs should be
near mode within the new AG (unless we jump again and again, which I
don't think is consistent with what you're describing).

Hmm, the next opportunity for this kind of behavior in the near mode
allocator is probably the bnobt left/right span. This would require the
right circumstances to hit. We'd have to bypass the first (cntbt)
algorithm then find a closer extent in the left mode search vs. the
right mode search, and then probably repeat that across however many
allocations it takes to work out of this state.

If instead we're badly breaking up an extent in the wrong order, it
looks like we do have the capability to allocate the right portion of an
extent (in xfs_alloc_compute_diff()) but that is only enabled for non
data allocations. xfs_alloc_compute_aligned() can cause a similar effect
if alignment is set, but I'm not sure that would break up an extent into
more than one usable chunk.

In any event, maybe I'll hack some temporary code in the xfs_db locality
stuff to quickly flag whether I happen to get lucky enough to reproduce
any instances of this during the associated test workloads (and if so,
try and collect more data).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster June 6, 2019, 3:21 p.m. UTC | #7
On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote:
> On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote:
> > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > > > > Hmmm.. I suppose if I had a script that
> > > > > just dumped every applicable stride/delta value for an inode, I could
> > > > > dump all of those numbers into a file and we can process it from there..
> > > > 
> > > > See how the freesp commands work in xfs_db - they just generate a
> > > > set of {offset, size} tuples that are then bucketted appropriately.
> > > > This is probably the best way to do this at the moment - have xfs_db
> > > > walk the inode BMBTs outputting something like {extent size,
> > > > distance to next extent} tuples and everything else falls out from
> > > > how we bucket that information.
> > > > 
> > > 
> > > That sounds plausible. A bit more involved than what I'm currently
> > > working with, but we do already have a blueprint for the scanning
> > > implementation required to collect this data via the frag command.
> > > Perhaps some of this code between the frag/freesp can be generalized and
> > > reused. I'll take a closer look at it.
> > > 
> > > My only concern is I'd prefer to only go down this path as long as we
> > > plan to land the associated command in xfs_db. So this approach suggests
> > > to me that we add a "locality" command similar to frag/freesp that
> > > presents the locality state of the fs. For now I'm only really concerned
> > > with the data associated with known near mode allocations (i.e., such as
> > > the extent size:distance relationship you've outlined above) so we can
> > > evaluate these algorithmic changes, but this would be for fs devs only
> > > so we could always expand on it down the road if we want to assess
> > > different allocations. Hm?
> > 
> > Yup, I'm needing to do similar analysis myself to determine how
> > quickly I'm aging the filesystem, so having the tool in xfs_db or
> > xfs_spaceman would be very useful.
> > 
> > FWIW, the tool I've just started writing will just use fallocate and
> > truncate to hammer the allocation code as hard and as quickly as
> > possible - I want to do accelerated aging of the filesystem, and so
> > being able to run tens to hundreds of thousands of free space
> > manipulations a second is the goal here....
> > 
> 
> Ok. FWIW, from playing with this so far (before getting distracted for
> much of this week) the most straightforward place to add this kind of
> functionality turns out to be the frag command itself. It does 99% of
> the work required to process data extents already, including pulling the
> on-disk records of each inode in-core for processing. I basically just
> had to update that code to include all of the record data and add the
> locality tracking logic (I haven't got to actually presenting it yet..).
> 

I managed to collect some preliminary data based on this strategy. I
regenerated the associated dataset as I wanted to introduce more near
mode allocation events where locality is relevant/measurable. The set is
still generated via filebench, but the workload now runs file 8x file
creators in parallel with 16x random size file appenders (with 1% of the
dataset being preallocated to support the appends, and thus without
contention). In total, it creates 10k files that amount to ~5.8TB of
space. The filesystem geometry is as follows:

meta-data=/dev/mapper/fedora_hpe--apollo--cn99xx--19-tmp isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=1941012480, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

The locality data is collected via a hacked up variant of 'xfs_db -c
frag ...' that reuses the 'freesp' command histogram to display locality
deltas instead of extent lengths. Each bucket shows how many
extents/allocs fall into that particular delta range and the average
length of the associated extents. Note that the percent column is based
on extent count vs the total (not block or delta counts). Finally, note
that the final row uses a dummy value of agsize to account AG jumps. As
such, the associated delta values are invalid, but the extent count/size
values are valid. This data is included to provide some insight on how
often we fall back to external AGs (from the locality target) due to
contention.

Bear in mind that so far I've only run this workload once on each kernel
and I believe there is opportunity for variance run-to-run. Current data
and observations follow:

- Baseline kernel (5.2.0-rc1):

Files on this filesystem average 59.81 extents per file
      from         to    extents           delta      avgsz    pct
         0          0          1               0       4879   0.00
         1          1         18              18        633   0.00
         2          3         76             193        630   0.01
         4          7        117             644      10801   0.02
         8         15        246            2735        701   0.04
        16         31        411            9769        873   0.07
        32         63        858           40614        877   0.14
        64        127       1931          183122        872   0.32
       128        255       3658          693079       1423   0.60
       256        511       4393         1619094        767   0.73
       512       1023       6049         4582491        666   1.00
      1024       2047      19564        34684608        810   3.23
      2048       4095      21391        62471744        828   3.53
      4096       8191      24735       140424459        920   4.08
      8192      16383      18383       213465918       1030   3.03
     16384      32767      14447       336272956       1195   2.38
     32768      65535      12359       580683797       1154   2.04
     65536     131071      11943      1138606730       1446   1.97
    131072     262143      16825      3279118006       1701   2.78
    262144     524287      32133     12725299194       1905   5.30
    524288    1048575      58899     45775066024       1845   9.72
   1048576    2097151      95302    147197567651       2020  15.73
   2097152    4194303      86659    252037848233       2794  14.30
   4194304    8388607      67067    397513880288       2876  11.07
   8388608   16777215      47940    583161227489       2319   7.91
  16777216   33554431      24878    537577890034       3321   4.11
  33554432   67108863       5889    269065981311      16106   0.97
  67108864  134217727       3022    304867554478      33429   0.50
 134217728  268435454       2994    539180795612      35744   0.49
 268435455  268435455      23709   6364336202595       4840   3.91

- Control (base w/ unconditional SIZE mode allocs):

Files on this filesystem average 58.60 extents per file
      from         to    extents           delta      avgsz    pct
         0          0          1               0        180   0.00
         4          7         19             115      11379   0.00
         8         15         21             231         15   0.00
        16         31          3              58         45   0.00
        64        127          3             288       7124   0.00
       128        255          4             780      60296   0.00
       256        511          3             978      51563   0.00
       512       1023          9            7072     105727   0.00
      1024       2047         33           50773       4765   0.01
      2048       4095         98          306258      15689   0.02
      4096       8191        258         1513775       1981   0.04
      8192      16383        458         5633481       2537   0.08
     16384      32767        934        23078682       3013   0.16
     32768      65535       1783        87851701       3109   0.30
     65536     131071       3382       332685185       1810   0.57
    131072     262143       8904      1784659842       2275   1.50
    262144     524287      23878      9433551033       1903   4.02
    524288    1048575      54422     42483032893       1894   9.17
   1048576    2097151      97884    148883431239       2048  16.49
   2097152    4194303      81999    236737184381       2741  13.81
   4194304    8388607      86826    510450130696       2639  14.63
   8388608   16777215      54870    652250378434       2101   9.24
  16777216   33554431      40408    985568011752       1959   6.81
  33554432   67108863      46354   2258464180697       2538   7.81
  67108864  134217727      59461   5705095317989       3380  10.02
 134217728  268435454      16205   2676447855794       4891   2.73
 268435455  268435455      15423   4140080022465       5243   2.60

- Test (base + this series):

Files on this filesystem average 59.76 extents per file
      from         to    extents           delta      avgsz    pct
         0          0          2               0        419   0.00
         1          1        258             258        387   0.04
         2          3         81             201        598   0.01
         4          7        139             790      13824   0.02
         8         15        257            2795        710   0.04
        16         31        417            9790        852   0.07
        32         63        643           30714        901   0.11
        64        127       1158          110148        835   0.19
       128        255       1947          370953        822   0.32
       256        511       3567         1348313        744   0.59
       512       1023       5151         3921794        695   0.85
      1024       2047      22895        39640251        924   3.78
      2048       4095      34375       100757727        922   5.68
      4096       8191      30381       171773183        914   5.02
      8192      16383      18977       214896159       1091   3.13
     16384      32767       8460       192726268       1212   1.40
     32768      65535       6071       286544623       1611   1.00
     65536     131071       7803       757765738       1680   1.29
    131072     262143      15300      3001300980       1877   2.53
    262144     524287      27218     10868169139       1993   4.50
    524288    1048575      60423     47321126020       1948   9.98
   1048576    2097151     100683    158191884842       2174  16.63
   2097152    4194303      92642    274106200889       2508  15.30
   4194304    8388607      73987    436219940202       2421  12.22
   8388608   16777215      49636    591854981117       2434   8.20
  16777216   33554431      15716    353157130237       4772   2.60
  33554432   67108863       4948    228004142686      19221   0.82
  67108864  134217727       2381    231811967738      35781   0.39
 134217728  268435454       2140    385403697868      29221   0.35
 268435455  268435455      17746   4763655584430       7603   2.93

Firstly, comparison of the baseline and control data shows that near
mode allocation is effective at improving allocation locality compared
to size mode. In both cases, the 1048576-4194304 buckets hold the
majority of extents. If we look at the sub-1048576 data, however, ~40%
of allocations land in this range on the baseline kernel vs. only ~16%
for the control. Another interesting data point is the noticeable drop
in AG skips (~24k) from the baseline kernel to the control (~15k). I
suspect this is due to the additional overhead of locality based
allocation causing more contention.

Comparison of the baseline and test data shows a generally similar
breakdown between the two. The sub-1048576 range populates the same
buckets and makes up ~41% of the total extents. The per-bucket numbers
differ, but all of the buckets are within a percentage point or so. One
previously unknown advantage of the test algorithm shown by this data is
that the number of AG skips drops down to ~18k, which almost splits the
difference between the baseline and control (and slightly in favor of
the latter). I suspect that is related to the more simple and bounded
near mode algorithm as compared to the current 

Thoughts on any of this data or presentation? I could dig further into
details or alternatively base the histogram on something like extent
size and show the average delta for each extent size bucket, but I'm not
sure that will tell us anything profound with respect to this patchset.
One thing I noticed while processing this data is that the current
dataset skews heavily towards smaller allocations. I still think it's a
useful comparison because smaller allocations are more likely to stress
either algorithm via a larger locality search space, but I may try to
repeat this test with a workload with fewer files and larger allocations
and see how that changes things.

Brian
Dave Chinner June 6, 2019, 10:05 p.m. UTC | #8
On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote:
> On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> > Most of the cases I've seen have had the same symptom - "skip to
> > next AG, allocate at same high-up-in AGBNO target as the previous AG
> > wanted, then allocate backwards in the same AG until freespace
> > extent is exhausted. It then skips to some other freespace extent,
> > and depending on whether it's a forwards or backwards skip the
> > problem either goes away or continues. This is not a new behaviour,
> > I first saw it some 15 years ago, but I've never been able to
> > provoke it reliably enough with test code to get to the root
> > cause...
> > 
> 
> I guess the biggest question to me is whether we're more looking for a
> backwards searching pattern or a pattern where we split up a larger free
> extent into smaller chunks (in reverse), or both. I can definitely see
> some isolated areas where a backwards search could lead to this
> behavior. E.g., my previous experiment to replace near mode allocs with
> size mode allocs always allocates in reverse when free space is
> sufficiently fragmented. To see this in practice would require repeated
> size mode allocations, however, which I think is unlikely because once
> we jump AGs and do a size mode alloc, the subsequent allocs should be
> near mode within the new AG (unless we jump again and again, which I
> don't think is consistent with what you're describing).
> 
> Hmm, the next opportunity for this kind of behavior in the near mode
> allocator is probably the bnobt left/right span. This would require the
> right circumstances to hit. We'd have to bypass the first (cntbt)
> algorithm then find a closer extent in the left mode search vs. the
> right mode search, and then probably repeat that across however many
> allocations it takes to work out of this state.
> 
> If instead we're badly breaking up an extent in the wrong order, it
> looks like we do have the capability to allocate the right portion of an
> extent (in xfs_alloc_compute_diff()) but that is only enabled for non
> data allocations. xfs_alloc_compute_aligned() can cause a similar effect
> if alignment is set, but I'm not sure that would break up an extent into
> more than one usable chunk.

This is pretty much matches what I've been able to infer about the
cause, but lacking a way to actually trigger it and be able to
monitor the behviour in real time is where I've got stuck on this.
I see the result in aged, fragmented filesystems and can infer how
it may have occurred, but can't cause it to occur on demand...

> In any event, maybe I'll hack some temporary code in the xfs_db locality
> stuff to quickly flag whether I happen to get lucky enough to reproduce
> any instances of this during the associated test workloads (and if so,
> try and collect more data).

*nod*

Best we can do, I think, and hope we stumble across an easily
reproducable trigger...

Cheers,

Dave.
Dave Chinner June 6, 2019, 10:13 p.m. UTC | #9
On Thu, Jun 06, 2019 at 11:21:04AM -0400, Brian Foster wrote:
> On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote:
> > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> > > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote:
> > > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> > > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > > > > > Hmmm.. I suppose if I had a script that
> > > > > > just dumped every applicable stride/delta value for an inode, I could
> > > > > > dump all of those numbers into a file and we can process it from there..
> > > > > 
> > > > > See how the freesp commands work in xfs_db - they just generate a
> > > > > set of {offset, size} tuples that are then bucketted appropriately.
> > > > > This is probably the best way to do this at the moment - have xfs_db
> > > > > walk the inode BMBTs outputting something like {extent size,
> > > > > distance to next extent} tuples and everything else falls out from
> > > > > how we bucket that information.
> > > > > 
> > > > 
> > > > That sounds plausible. A bit more involved than what I'm currently
> > > > working with, but we do already have a blueprint for the scanning
> > > > implementation required to collect this data via the frag command.
> > > > Perhaps some of this code between the frag/freesp can be generalized and
> > > > reused. I'll take a closer look at it.
> > > > 
> > > > My only concern is I'd prefer to only go down this path as long as we
> > > > plan to land the associated command in xfs_db. So this approach suggests
> > > > to me that we add a "locality" command similar to frag/freesp that
> > > > presents the locality state of the fs. For now I'm only really concerned
> > > > with the data associated with known near mode allocations (i.e., such as
> > > > the extent size:distance relationship you've outlined above) so we can
> > > > evaluate these algorithmic changes, but this would be for fs devs only
> > > > so we could always expand on it down the road if we want to assess
> > > > different allocations. Hm?
> > > 
> > > Yup, I'm needing to do similar analysis myself to determine how
> > > quickly I'm aging the filesystem, so having the tool in xfs_db or
> > > xfs_spaceman would be very useful.
> > > 
> > > FWIW, the tool I've just started writing will just use fallocate and
> > > truncate to hammer the allocation code as hard and as quickly as
> > > possible - I want to do accelerated aging of the filesystem, and so
> > > being able to run tens to hundreds of thousands of free space
> > > manipulations a second is the goal here....
> > > 
> > 
> > Ok. FWIW, from playing with this so far (before getting distracted for
> > much of this week) the most straightforward place to add this kind of
> > functionality turns out to be the frag command itself. It does 99% of
> > the work required to process data extents already, including pulling the
> > on-disk records of each inode in-core for processing. I basically just
> > had to update that code to include all of the record data and add the
> > locality tracking logic (I haven't got to actually presenting it yet..).
> > 
> 
> I managed to collect some preliminary data based on this strategy. I
....
> 
> Comparison of the baseline and test data shows a generally similar
> breakdown between the two.

Which is the result I wanted to see :)

> Thoughts on any of this data or presentation?

I think it's useful for comparing whether an allocator change has
affected the overall locality of allocation. If it's working as we
expect, you should get vastly different results for inode32 vs
inode64 mount options, with inode32 showing much, much higher
distances for most allocations, so it might be worth running a quick
test to confirm that it does, indeed, demonstrate the results we'd
expect from such a change.

> I could dig further into
> details or alternatively base the histogram on something like extent
> size and show the average delta for each extent size bucket, but I'm not
> sure that will tell us anything profound with respect to this patchset.

*nod*

> One thing I noticed while processing this data is that the current
> dataset skews heavily towards smaller allocations. I still think it's a
> useful comparison because smaller allocations are more likely to stress
> either algorithm via a larger locality search space, but I may try to
> repeat this test with a workload with fewer files and larger allocations
> and see how that changes things.

From the testing I've been doing, I think the file count of around
10k isn't sufficient to really cause severe allocation issues.
Directory and inodes metadata are great for fragmenting free space,
so dramtically increasing the number of smaller files might actually
produce worse behaviour....

Cheers,

Dave.
Brian Foster June 7, 2019, 12:56 p.m. UTC | #10
On Fri, Jun 07, 2019 at 08:05:59AM +1000, Dave Chinner wrote:
> On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote:
> > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> > > Most of the cases I've seen have had the same symptom - "skip to
> > > next AG, allocate at same high-up-in AGBNO target as the previous AG
> > > wanted, then allocate backwards in the same AG until freespace
> > > extent is exhausted. It then skips to some other freespace extent,
> > > and depending on whether it's a forwards or backwards skip the
> > > problem either goes away or continues. This is not a new behaviour,
> > > I first saw it some 15 years ago, but I've never been able to
> > > provoke it reliably enough with test code to get to the root
> > > cause...
> > > 
> > 
> > I guess the biggest question to me is whether we're more looking for a
> > backwards searching pattern or a pattern where we split up a larger free
> > extent into smaller chunks (in reverse), or both. I can definitely see
> > some isolated areas where a backwards search could lead to this
> > behavior. E.g., my previous experiment to replace near mode allocs with
> > size mode allocs always allocates in reverse when free space is
> > sufficiently fragmented. To see this in practice would require repeated
> > size mode allocations, however, which I think is unlikely because once
> > we jump AGs and do a size mode alloc, the subsequent allocs should be
> > near mode within the new AG (unless we jump again and again, which I
> > don't think is consistent with what you're describing).
> > 
> > Hmm, the next opportunity for this kind of behavior in the near mode
> > allocator is probably the bnobt left/right span. This would require the
> > right circumstances to hit. We'd have to bypass the first (cntbt)
> > algorithm then find a closer extent in the left mode search vs. the
> > right mode search, and then probably repeat that across however many
> > allocations it takes to work out of this state.
> > 
> > If instead we're badly breaking up an extent in the wrong order, it
> > looks like we do have the capability to allocate the right portion of an
> > extent (in xfs_alloc_compute_diff()) but that is only enabled for non
> > data allocations. xfs_alloc_compute_aligned() can cause a similar effect
> > if alignment is set, but I'm not sure that would break up an extent into
> > more than one usable chunk.
> 
> This is pretty much matches what I've been able to infer about the
> cause, but lacking a way to actually trigger it and be able to
> monitor the behviour in real time is where I've got stuck on this.
> I see the result in aged, fragmented filesystems and can infer how
> it may have occurred, but can't cause it to occur on demand...
> 

Ok.

> > In any event, maybe I'll hack some temporary code in the xfs_db locality
> > stuff to quickly flag whether I happen to get lucky enough to reproduce
> > any instances of this during the associated test workloads (and if so,
> > try and collect more data).
> 
> *nod*
> 
> Best we can do, I think, and hope we stumble across an easily
> reproducable trigger...
> 

Unfortunately I haven't seen any instances of this in the workloads I
ran to generate the most recent datasets. I have a couple more
experiments to try though so I'll keep an eye out.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Brian Foster June 7, 2019, 12:57 p.m. UTC | #11
On Fri, Jun 07, 2019 at 08:13:01AM +1000, Dave Chinner wrote:
> On Thu, Jun 06, 2019 at 11:21:04AM -0400, Brian Foster wrote:
> > On Fri, May 31, 2019 at 01:11:36PM -0400, Brian Foster wrote:
> > > On Sun, May 26, 2019 at 08:43:17AM +1000, Dave Chinner wrote:
> > > > On Fri, May 24, 2019 at 08:00:18AM -0400, Brian Foster wrote:
> > > > > On Fri, May 24, 2019 at 08:15:52AM +1000, Dave Chinner wrote:
> > > > > > On Thu, May 23, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > > > > > > Hmmm.. I suppose if I had a script that
> > > > > > > just dumped every applicable stride/delta value for an inode, I could
> > > > > > > dump all of those numbers into a file and we can process it from there..
> > > > > > 
> > > > > > See how the freesp commands work in xfs_db - they just generate a
> > > > > > set of {offset, size} tuples that are then bucketted appropriately.
> > > > > > This is probably the best way to do this at the moment - have xfs_db
> > > > > > walk the inode BMBTs outputting something like {extent size,
> > > > > > distance to next extent} tuples and everything else falls out from
> > > > > > how we bucket that information.
> > > > > > 
> > > > > 
> > > > > That sounds plausible. A bit more involved than what I'm currently
> > > > > working with, but we do already have a blueprint for the scanning
> > > > > implementation required to collect this data via the frag command.
> > > > > Perhaps some of this code between the frag/freesp can be generalized and
> > > > > reused. I'll take a closer look at it.
> > > > > 
> > > > > My only concern is I'd prefer to only go down this path as long as we
> > > > > plan to land the associated command in xfs_db. So this approach suggests
> > > > > to me that we add a "locality" command similar to frag/freesp that
> > > > > presents the locality state of the fs. For now I'm only really concerned
> > > > > with the data associated with known near mode allocations (i.e., such as
> > > > > the extent size:distance relationship you've outlined above) so we can
> > > > > evaluate these algorithmic changes, but this would be for fs devs only
> > > > > so we could always expand on it down the road if we want to assess
> > > > > different allocations. Hm?
> > > > 
> > > > Yup, I'm needing to do similar analysis myself to determine how
> > > > quickly I'm aging the filesystem, so having the tool in xfs_db or
> > > > xfs_spaceman would be very useful.
> > > > 
> > > > FWIW, the tool I've just started writing will just use fallocate and
> > > > truncate to hammer the allocation code as hard and as quickly as
> > > > possible - I want to do accelerated aging of the filesystem, and so
> > > > being able to run tens to hundreds of thousands of free space
> > > > manipulations a second is the goal here....
> > > > 
> > > 
> > > Ok. FWIW, from playing with this so far (before getting distracted for
> > > much of this week) the most straightforward place to add this kind of
> > > functionality turns out to be the frag command itself. It does 99% of
> > > the work required to process data extents already, including pulling the
> > > on-disk records of each inode in-core for processing. I basically just
> > > had to update that code to include all of the record data and add the
> > > locality tracking logic (I haven't got to actually presenting it yet..).
> > > 
> > 
> > I managed to collect some preliminary data based on this strategy. I
> ....
> > 
> > Comparison of the baseline and test data shows a generally similar
> > breakdown between the two.
> 
> Which is the result I wanted to see :)
> 

Indeed. ;)

> > Thoughts on any of this data or presentation?
> 
> I think it's useful for comparing whether an allocator change has
> affected the overall locality of allocation. If it's working as we
> expect, you should get vastly different results for inode32 vs
> inode64 mount options, with inode32 showing much, much higher
> distances for most allocations, so it might be worth running a quick
> test to confirm that it does, indeed, demonstrate the results we'd
> expect from such a change.
> 

Ok, I can add inode32 into the mix as a sanity check. I'm guessing this
will result in an increase in AG skips. I also think that once we have
that first data allocation, the existing heuristics may apply similarly
since we're only concerned about contiguity with the previous extent at
that point. Perhaps I'll reserve this for a test with an even higher
inode count so there are more first data extent allocations (where the
inode is the locality target, but sits in a metadata preferred AG) to
measure.

> > I could dig further into
> > details or alternatively base the histogram on something like extent
> > size and show the average delta for each extent size bucket, but I'm not
> > sure that will tell us anything profound with respect to this patchset.
> 
> *nod*
> 
> > One thing I noticed while processing this data is that the current
> > dataset skews heavily towards smaller allocations. I still think it's a
> > useful comparison because smaller allocations are more likely to stress
> > either algorithm via a larger locality search space, but I may try to
> > repeat this test with a workload with fewer files and larger allocations
> > and see how that changes things.
> 
> From the testing I've been doing, I think the file count of around
> 10k isn't sufficient to really cause severe allocation issues.
> Directory and inodes metadata are great for fragmenting free space,
> so dramtically increasing the number of smaller files might actually
> produce worse behaviour....
> 

I'd still like to see the results for larger allocations if for nothing
else that I was hoping for more coverage from the most recent test. Once
I get through that, I'll move back in the smaller file direction and
increase the file count as well as the span of the directory tree.

Hmm, random file sizes between 4k and say 96k or so should allow me to
at least create tens of millions of files/dirs (I'd like to get above
64k for some of these allocations to cover post-eof prealloc). That
should stress the allocator as noted with the additional inode/dir
metadata allocations as well as facilitate inode32 experiments.

Thanks for the feedback..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
Darrick J. Wong June 21, 2019, 3:18 p.m. UTC | #12
On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> Hi all,
> 
> This is v2 of the extent allocation rework series. The changes in this
> version are mostly associated with code factoring, based on feedback to
> v1. The small mode helper refactoring has been isolated and pulled to
> the start of the series. The active flag that necessitated the btree
> cursor container structure has been pushed into the xfs_btree_cur
> private area. The resulting high level allocation code in
> xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> level of abstraction. Finally, there are various minor cleanups and
> fixes.

Hmmm.  Just for fun I decided to apply this series to see what would
happen, and on a 1k block filesystem shook out a test failure that seems
like it could be related?

MKFS_OPTIONS='-f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf'
MOUNT_OPTIONS='-o usrquota,grpquota,prjquota /dev/sdf /opt'

--D

--- generic/223.out
+++ generic/223.out.bad
@@ -48,7 +48,7 @@
 === Testing size 1g falloc on 8k stripe ===
 SCRATCH_MNT/file-1g-falloc: well-aligned
 === Testing size 1073745920 falloc on 8k stripe ===
-SCRATCH_MNT/file-1073745920-falloc: well-aligned
+SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 2
 === mkfs with su 4 blocks x 4 ===
 === Testing size 1*16k on 16k stripe ===
 SCRATCH_MNT/file-1-16384-falloc: well-aligned
@@ -98,7 +98,7 @@
 === Testing size 1g falloc on 16k stripe ===
 SCRATCH_MNT/file-1g-falloc: well-aligned
 === Testing size 1073745920 falloc on 16k stripe ===
-SCRATCH_MNT/file-1073745920-falloc: well-aligned
+SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 4
 === mkfs with su 8 blocks x 4 ===
 === Testing size 1*32k on 32k stripe ===
 SCRATCH_MNT/file-1-32768-falloc: well-aligned
@@ -148,7 +148,7 @@
 === Testing size 1g falloc on 32k stripe ===
 SCRATCH_MNT/file-1g-falloc: well-aligned
 === Testing size 1073745920 falloc on 32k stripe ===
-SCRATCH_MNT/file-1073745920-falloc: well-aligned
+SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 8
 === mkfs with su 16 blocks x 4 ===
 === Testing size 1*64k on 64k stripe ===
 SCRATCH_MNT/file-1-65536-falloc: well-aligned
@@ -198,7 +198,7 @@
 === Testing size 1g falloc on 64k stripe ===
 SCRATCH_MNT/file-1g-falloc: well-aligned
 === Testing size 1073745920 falloc on 64k stripe ===
-SCRATCH_MNT/file-1073745920-falloc: well-aligned
+SCRATCH_MNT/file-1073745920-falloc: Start block 197665 not multiple of sunit 16
 === mkfs with su 32 blocks x 4 ===
 === Testing size 1*128k on 128k stripe ===
 SCRATCH_MNT/file-1-131072-falloc: well-aligned
Brian Foster July 1, 2019, 7:12 p.m. UTC | #13
On Fri, Jun 21, 2019 at 08:18:35AM -0700, Darrick J. Wong wrote:
> On Wed, May 22, 2019 at 02:05:35PM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > This is v2 of the extent allocation rework series. The changes in this
> > version are mostly associated with code factoring, based on feedback to
> > v1. The small mode helper refactoring has been isolated and pulled to
> > the start of the series. The active flag that necessitated the btree
> > cursor container structure has been pushed into the xfs_btree_cur
> > private area. The resulting high level allocation code in
> > xfs_ag_alloc_vextent() has been cleaned up to remove an unnecessary
> > level of abstraction. Finally, there are various minor cleanups and
> > fixes.
> 
> Hmmm.  Just for fun I decided to apply this series to see what would
> happen, and on a 1k block filesystem shook out a test failure that seems
> like it could be related?
> 

I had reproduced this earlier on and eventually determined it was kind
of circumstantial with respect to this series. I had eliminated some of
the operations from generic/223 for more quick reproduction/RCA and
ultimately reproduced the same behavior on upstream (at the time, which
was probably a month or two ago by now) with a slightly different
workload. That lead to the following series:

https://marc.info/?l=linux-xfs&m=155671950608062&w=2

... which IIRC addressed the problem in both scenarios. Thoughts on
those patches?

Brian

> MKFS_OPTIONS='-f -m reflink=1,rmapbt=1 -i sparse=1, -b size=1024, /dev/sdf'
> MOUNT_OPTIONS='-o usrquota,grpquota,prjquota /dev/sdf /opt'
> 
> --D
> 
> --- generic/223.out
> +++ generic/223.out.bad
> @@ -48,7 +48,7 @@
>  === Testing size 1g falloc on 8k stripe ===
>  SCRATCH_MNT/file-1g-falloc: well-aligned
>  === Testing size 1073745920 falloc on 8k stripe ===
> -SCRATCH_MNT/file-1073745920-falloc: well-aligned
> +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 2
>  === mkfs with su 4 blocks x 4 ===
>  === Testing size 1*16k on 16k stripe ===
>  SCRATCH_MNT/file-1-16384-falloc: well-aligned
> @@ -98,7 +98,7 @@
>  === Testing size 1g falloc on 16k stripe ===
>  SCRATCH_MNT/file-1g-falloc: well-aligned
>  === Testing size 1073745920 falloc on 16k stripe ===
> -SCRATCH_MNT/file-1073745920-falloc: well-aligned
> +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 4
>  === mkfs with su 8 blocks x 4 ===
>  === Testing size 1*32k on 32k stripe ===
>  SCRATCH_MNT/file-1-32768-falloc: well-aligned
> @@ -148,7 +148,7 @@
>  === Testing size 1g falloc on 32k stripe ===
>  SCRATCH_MNT/file-1g-falloc: well-aligned
>  === Testing size 1073745920 falloc on 32k stripe ===
> -SCRATCH_MNT/file-1073745920-falloc: well-aligned
> +SCRATCH_MNT/file-1073745920-falloc: Start block 197673 not multiple of sunit 8
>  === mkfs with su 16 blocks x 4 ===
>  === Testing size 1*64k on 64k stripe ===
>  SCRATCH_MNT/file-1-65536-falloc: well-aligned
> @@ -198,7 +198,7 @@
>  === Testing size 1g falloc on 64k stripe ===
>  SCRATCH_MNT/file-1g-falloc: well-aligned
>  === Testing size 1073745920 falloc on 64k stripe ===
> -SCRATCH_MNT/file-1073745920-falloc: well-aligned
> +SCRATCH_MNT/file-1073745920-falloc: Start block 197665 not multiple of sunit 16
>  === mkfs with su 32 blocks x 4 ===
>  === Testing size 1*128k on 128k stripe ===
>  SCRATCH_MNT/file-1-131072-falloc: well-aligned