Message ID | 20171127202434.43125-4-bfoster@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: > We've had rare reports of transaction overruns in > xfs_inactive_ifree() for quite some time. Analysis of a reproducing > metadump has shown the problem is essentially caused by performing > too many agfl block frees in a single transaction. > > For example, an inode chunk is freed and the associated agfl fixup > algorithm discovers it needs to free a single agfl block before the > chunk free occurs. This single block ends up causing a space btree > join and adds one or more blocks back onto the agfl. This causes > xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a > single block discrepency. > > The transaction overrun occurs under several other unfortunate > circumstances: > > - Each agfl block free is left+right contiguous. This requires 2 > record deletions and 1 insertion for the cntbt and thus requires > more log reservation than normal. > - The associated transaction is the first in the CIL ctx and thus > the ctx header reservation is consumed. > - The transaction reservation is larger than a log buffer and thus > extra split header reservation is consumed. > > As a result of the agfl and free space state of the filesystem, the > agfl fixup code has dirtied more cntbt buffer space than allowed by > the portion of the reservation allotted for block allocation. This > is all before the real allocation even starts! > > Note that the log related conditions above are correctly covered by > the existing transaction reservation. The above demonstrates that > the reservation allotted for the context/split headers may help > suppress overruns in the more common case where that reservation > goes unused for its intended purpose. > > To address this problem, update xfs_alloc_fix_freelist() to amortize > agfl block frees over multiple transactions. Free one block per > transaction so long as the agfl is less than half free. The agfl > minimum allocation requirement is dynamic, but is based on the > geometry of the associated btrees (i.e., level count) and therefore > should be easily rectified over multiple allocation transactions. > Further, there is no real harm in leaving extraneous blocks on the > agfl so long as there are enough free slots available for btree > blocks freed as a result of the upcoming allocation. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++----- > 1 file changed, 10 insertions(+), 5 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > index 0da80019a917..d8d58e35da00 100644 > --- a/fs/xfs/libxfs/xfs_alloc.c > +++ b/fs/xfs/libxfs/xfs_alloc.c > @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist( > * appropriately based on the recursion count and dirty state of the > * buffer. > * > - * XXX (dgc): When we have lots of free space, does this buy us > - * anything other than extra overhead when we need to put more blocks > - * back on the free list? Maybe we should only do this when space is > - * getting low or the AGFL is more than half full? > - * > * The NOSHRINK flag prevents the AGFL from being shrunk if it's too > * big; the NORMAP flag prevents AGFL expand/shrink operations from > * updating the rmapbt. Both flags are used in xfs_repair while we're > @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist( > goto out_agbp_relse; > } > xfs_trans_binval(tp, bp); > + > + /* > + * Freeing all extra agfl blocks adds too much log reservation > + * overhead to a single transaction, particularly considering > + * that freeing a block can cause a btree join and put one right > + * back on the agfl. Try to free one block per tx so long as > + * we've left enough free slots for the upcoming modifications. > + */ > + if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1)) > + break; > } In /theory/, this /should/ work. However, as the comment you removed implies, there are likely to be issues with this as we get near ENOSPC. We know that if we don't trim the AGFL right down to the minimum requested as we approach ENOSPC we can get premature ENOSPC events being reported that lead to filesystem shutdowns. (e.g. need the very last free block for a BMBT block to complete a data extent allocation). Hence I'd suggest that this needs to be aware of the low space allocation algorithm (i.e. dfops->dop_low is true) to trim the agfl right back when we are really short of space I'm also concerned that it doesn't take into account that freeing a block from the AGFL could cause a freespace tree split to occur, thereby emptying the AGFL whilst consuming the entire log reservation for tree modifications. This leaves nothing in the log reservation for re-growing the AGFL to the minimum required, which we /must do/ before returning and could cause more splits/joins to occur. IMO, there's a lot more to be concerned about here than just trying to work around the specific symptom observed in the given test case. This code is, unfortunately, really tricky and intricate and history tells us that the risk of unexpected regressions is extremely high, especially around ENOSPC related issues. Of all the patches in this series, this is the most dangerous and "iffy" of them and the one we should be most concerned and conservative about.... I'm not sure what the solution to this problem is, but I'm extremely hesitant to make changes like this without an awful lot more analysis of it's impact. Cheers, Dave.
On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote: > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: > > We've had rare reports of transaction overruns in > > xfs_inactive_ifree() for quite some time. Analysis of a reproducing > > metadump has shown the problem is essentially caused by performing > > too many agfl block frees in a single transaction. > > > > For example, an inode chunk is freed and the associated agfl fixup > > algorithm discovers it needs to free a single agfl block before the > > chunk free occurs. This single block ends up causing a space btree > > join and adds one or more blocks back onto the agfl. This causes > > xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a > > single block discrepency. > > > > The transaction overrun occurs under several other unfortunate > > circumstances: > > > > - Each agfl block free is left+right contiguous. This requires 2 > > record deletions and 1 insertion for the cntbt and thus requires > > more log reservation than normal. > > - The associated transaction is the first in the CIL ctx and thus > > the ctx header reservation is consumed. > > - The transaction reservation is larger than a log buffer and thus > > extra split header reservation is consumed. > > > > As a result of the agfl and free space state of the filesystem, the > > agfl fixup code has dirtied more cntbt buffer space than allowed by > > the portion of the reservation allotted for block allocation. This > > is all before the real allocation even starts! > > > > Note that the log related conditions above are correctly covered by > > the existing transaction reservation. The above demonstrates that > > the reservation allotted for the context/split headers may help > > suppress overruns in the more common case where that reservation > > goes unused for its intended purpose. > > > > To address this problem, update xfs_alloc_fix_freelist() to amortize > > agfl block frees over multiple transactions. Free one block per > > transaction so long as the agfl is less than half free. The agfl > > minimum allocation requirement is dynamic, but is based on the > > geometry of the associated btrees (i.e., level count) and therefore > > should be easily rectified over multiple allocation transactions. > > Further, there is no real harm in leaving extraneous blocks on the > > agfl so long as there are enough free slots available for btree > > blocks freed as a result of the upcoming allocation. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++----- > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > index 0da80019a917..d8d58e35da00 100644 > > --- a/fs/xfs/libxfs/xfs_alloc.c > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist( > > * appropriately based on the recursion count and dirty state of the > > * buffer. > > * > > - * XXX (dgc): When we have lots of free space, does this buy us > > - * anything other than extra overhead when we need to put more blocks > > - * back on the free list? Maybe we should only do this when space is > > - * getting low or the AGFL is more than half full? > > - * > > * The NOSHRINK flag prevents the AGFL from being shrunk if it's too > > * big; the NORMAP flag prevents AGFL expand/shrink operations from > > * updating the rmapbt. Both flags are used in xfs_repair while we're > > @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist( > > goto out_agbp_relse; > > } > > xfs_trans_binval(tp, bp); > > + > > + /* > > + * Freeing all extra agfl blocks adds too much log reservation > > + * overhead to a single transaction, particularly considering > > + * that freeing a block can cause a btree join and put one right > > + * back on the agfl. Try to free one block per tx so long as > > + * we've left enough free slots for the upcoming modifications. > > + */ > > + if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1)) > > + break; > > } > > In /theory/, this /should/ work. However, as the comment you removed > implies, there are likely to be issues with this as we get near > ENOSPC. We know that if we don't trim the AGFL right down to the > minimum requested as we approach ENOSPC we can get premature ENOSPC > events being reported that lead to filesystem shutdowns. (e.g. need > the very last free block for a BMBT block to complete a data extent > allocation). Hence I'd suggest that this needs to be aware of the > low space allocation algorithm (i.e. dfops->dop_low is true) to trim > the agfl right back when we are really short of space > Hmm, shouldn't the worst case bmbt requirements be satisfied by the block reservation of the transaction that maps the extent (or stored as indirect reservation if delalloc)? I'm less concerned with premature ENOSPC from contexts where it's not fatal.. IIRC, I think we've already incorporated changes (Christoph's minfree stuff rings a bell) into the allocator that explicitly prefer premature ENOSPC over potentially fatal conditions, but I could be mistaken. We're also only talking about 256k or so for half of the AGFL, less than that if we assume that some of those blocks are actually required by the agfl and not slated to be lazily freed. We could also think about explicitly fixing up agfl surplus from other contexts (background, write -ENOSPC handling, etc.) if it became that much of a problem. I do think it's semi-reasonable from a conservative standpoint to further restrict this behavior to !dop_low conditions, but I'm a little concerned about creating an "open the floodgates" type situation when the agfl is allowed to carry extra blocks and then all of a sudden the free space state changes and one transaction is allowed to free a bunch of blocks. Thoughts? I suppose we could incorporate logic that frees until/unless a join occurs (i.e., the last free did not drop flcount), the latter being an indication that we've probably logged as much as we should for agfl fixups in the transaction. But that's also something we could just do unconditionally as opposed to only under dop_low conditions. That might be less aggressive of a change from current behavior. > I'm also concerned that it doesn't take into account that freeing > a block from the AGFL could cause a freespace tree split to occur, > thereby emptying the AGFL whilst consuming the entire log > reservation for tree modifications. This leaves nothing in the log > reservation for re-growing the AGFL to the minimum required, which > we /must do/ before returning and could cause more splits/joins to > occur. > How is this different from the current code? This sounds to me like an unconditional side effect of the fact that freeing an agfl block can indirectly affect the agfl via the btree operations. IOW, freeing a single extra block could require consuming one or more and trigger the need for an allocation. I suspect the allocation could then very well cause a join on the other tree and put more than one block back onto the agfl. > IMO, there's a lot more to be concerned about here than just trying > to work around the specific symptom observed in the given test case. > This code is, unfortunately, really tricky and intricate and history > tells us that the risk of unexpected regressions is extremely high, > especially around ENOSPC related issues. Of all the patches in this > series, this is the most dangerous and "iffy" of them and the > one we should be most concerned and conservative about.... > Agreed. The impact is something that also has to be evaluated over a sequence of transactions along with the more obvious impact on a single transaction. Brian > I'm not sure what the solution to this problem is, but I'm extremely > hesitant to make changes like this without an awful lot more > analysis of it's impact. > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote: > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote: > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: > > > We've had rare reports of transaction overruns in > > > xfs_inactive_ifree() for quite some time. Analysis of a reproducing > > > metadump has shown the problem is essentially caused by performing > > > too many agfl block frees in a single transaction. > > > > > > For example, an inode chunk is freed and the associated agfl fixup > > > algorithm discovers it needs to free a single agfl block before the > > > chunk free occurs. This single block ends up causing a space btree > > > join and adds one or more blocks back onto the agfl. This causes > > > xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a > > > single block discrepency. > > > > > > The transaction overrun occurs under several other unfortunate > > > circumstances: > > > > > > - Each agfl block free is left+right contiguous. This requires 2 > > > record deletions and 1 insertion for the cntbt and thus requires > > > more log reservation than normal. > > > - The associated transaction is the first in the CIL ctx and thus > > > the ctx header reservation is consumed. > > > - The transaction reservation is larger than a log buffer and thus > > > extra split header reservation is consumed. > > > > > > As a result of the agfl and free space state of the filesystem, the > > > agfl fixup code has dirtied more cntbt buffer space than allowed by > > > the portion of the reservation allotted for block allocation. This > > > is all before the real allocation even starts! > > > > > > Note that the log related conditions above are correctly covered by > > > the existing transaction reservation. The above demonstrates that > > > the reservation allotted for the context/split headers may help > > > suppress overruns in the more common case where that reservation > > > goes unused for its intended purpose. > > > > > > To address this problem, update xfs_alloc_fix_freelist() to amortize > > > agfl block frees over multiple transactions. Free one block per > > > transaction so long as the agfl is less than half free. The agfl > > > minimum allocation requirement is dynamic, but is based on the > > > geometry of the associated btrees (i.e., level count) and therefore > > > should be easily rectified over multiple allocation transactions. > > > Further, there is no real harm in leaving extraneous blocks on the > > > agfl so long as there are enough free slots available for btree > > > blocks freed as a result of the upcoming allocation. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++----- > > > 1 file changed, 10 insertions(+), 5 deletions(-) > > > > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c > > > index 0da80019a917..d8d58e35da00 100644 > > > --- a/fs/xfs/libxfs/xfs_alloc.c > > > +++ b/fs/xfs/libxfs/xfs_alloc.c > > > @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist( > > > * appropriately based on the recursion count and dirty state of the > > > * buffer. > > > * > > > - * XXX (dgc): When we have lots of free space, does this buy us > > > - * anything other than extra overhead when we need to put more blocks > > > - * back on the free list? Maybe we should only do this when space is > > > - * getting low or the AGFL is more than half full? > > > - * > > > * The NOSHRINK flag prevents the AGFL from being shrunk if it's too > > > * big; the NORMAP flag prevents AGFL expand/shrink operations from > > > * updating the rmapbt. Both flags are used in xfs_repair while we're > > > @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist( > > > goto out_agbp_relse; > > > } > > > xfs_trans_binval(tp, bp); > > > + > > > + /* > > > + * Freeing all extra agfl blocks adds too much log reservation > > > + * overhead to a single transaction, particularly considering > > > + * that freeing a block can cause a btree join and put one right > > > + * back on the agfl. Try to free one block per tx so long as > > > + * we've left enough free slots for the upcoming modifications. > > > + */ > > > + if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1)) > > > + break; > > > } > > > > In /theory/, this /should/ work. However, as the comment you removed > > implies, there are likely to be issues with this as we get near > > ENOSPC. We know that if we don't trim the AGFL right down to the > > minimum requested as we approach ENOSPC we can get premature ENOSPC > > events being reported that lead to filesystem shutdowns. (e.g. need > > the very last free block for a BMBT block to complete a data extent > > allocation). Hence I'd suggest that this needs to be aware of the > > low space allocation algorithm (i.e. dfops->dop_low is true) to trim > > the agfl right back when we are really short of space > > > > Hmm, shouldn't the worst case bmbt requirements be satisfied by the > block reservation of the transaction that maps the extent (or stored as > indirect reservation if delalloc)? That's checked against global free space when we do the initial reservation. It's not checked against AG free space until we make the allocation attempt. The problem here is that if we don't leave enough space for the BMBT block in the same AG as the data extent was allocated we can end up with an ENOSPC on the BMBT block allocation because, say, AGF locking order and all the higher AGFs are already ENOSPC. That's a fatal ENOSPC error as the transaction is already dirty and will shutdown the filesystem on transaction cancel. > I'm less concerned with premature > ENOSPC from contexts where it's not fatal.. The definition of "premature ENOSPC" is "unexpected ENOSPC that causes a filesystem shutdown" :/ > > IIRC, I think we've already > incorporated changes (Christoph's minfree stuff rings a bell) into the > allocator that explicitly prefer premature ENOSPC over potentially fatal > conditions, but I could be mistaken. Yes, but that doesn't take into account all the crazy AGFL manipulations that could occur as a result of trimming/extending the AGFL. We /assume/ that we don't need signification reservations to extend/trim the AGFL to what is required for the transaction. If we now how to free 50 blocks from the AGFL back to the freespace tree before we start the allocation/free operation, that's going to have some impact on the reservations we have. And think about 4k sector filesystems - there's close on 1000 entries in the AGFL - this means we could have 500 blocks floating on the AGFL that would otherwise be in the freespace tree. IOWs, this change could have significant impact on freespace fragmentation because blocks aren't getting returned to the freespace tree where they merge with other small freespace extents to reform larger extents. It's these sorts of issues (i.e. filesystem aging) that might not show up for months of production use that we need to think about. It may be that a larger AGFL helps here because it keeps a working set of freespace tree blocks on the AGFL rather than having to go back to the freespace trees all the time, but this is something we need to measure, analyse and characterise before changing a critical piece of the allocation architecture.... > We're also only talking about 256k > or so for half of the AGFL, less than that if we assume that some of > those blocks are actually required by the agfl and not slated to be > lazily freed. We could also think about explicitly fixing up agfl > surplus from other contexts (background, write -ENOSPC handling, etc.) > if it became that much of a problem. > > I do think it's semi-reasonable from a conservative standpoint to > further restrict this behavior to !dop_low conditions, but I'm a little > concerned about creating an "open the floodgates" type situation when > the agfl is allowed to carry extra blocks and then all of a sudden the > free space state changes and one transaction is allowed to free a bunch > of blocks. Thoughts? That's exactly the sort of problem we need to measure, analyse and characterise before making this sort of change :/ > I suppose we could incorporate logic that frees until/unless a join > occurs (i.e., the last free did not drop flcount), the latter being an > indication that we've probably logged as much as we should for agfl > fixups in the transaction. But that's also something we could just do > unconditionally as opposed to only under dop_low conditions. That might > be less aggressive of a change from current behavior. Heuristics will still need to be analysed and tested :/ The complexity of doing this is why I left a small comment rather than actually making the change.... > > I'm also concerned that it doesn't take into account that freeing > > a block from the AGFL could cause a freespace tree split to occur, > > thereby emptying the AGFL whilst consuming the entire log > > reservation for tree modifications. This leaves nothing in the log > > reservation for re-growing the AGFL to the minimum required, which > > we /must do/ before returning and could cause more splits/joins to > > occur. > > > > How is this different from the current code? It's not. I'm pointing out that while you're focussed on the problems with shortening the AGFL, the same problem exists with /extending the AGFL/. > This sounds to me like an > unconditional side effect of the fact that freeing an agfl block can > indirectly affect the agfl via the btree operations. IOW, freeing a > single extra block could require consuming one or more and trigger the > need for an allocation. I suspect the allocation could then very well > cause a join on the other tree and put more than one block back onto the > agfl. Yes, it could do that too. Remove a single block from an existing free extent, no change to the by-block btree. by-cnt now requires a record delete (full height join) followed by record insert elsewhere in the tree (full height split). So the attempt to add a block to the AGFL can actually shorten it if the by-cnt tree splits on insert. It can grow if the by-block or by-cnt tree joins on record removal. Either way, we've got the same problem of using the entire log reservation for AGFL modification when growing the AGFL as we do when trying to shrink the AGFL down. That's my point here - just hacking a limit into the shrink case doesn't address the problem at all - it just papers over one of the visible symptoms.... > > IMO, there's a lot more to be concerned about here than just trying > > to work around the specific symptom observed in the given test case. > > This code is, unfortunately, really tricky and intricate and history > > tells us that the risk of unexpected regressions is extremely high, > > especially around ENOSPC related issues. Of all the patches in this > > series, this is the most dangerous and "iffy" of them and the > > one we should be most concerned and conservative about.... > > > > Agreed. The impact is something that also has to be evaluated over a > sequence of transactions along with the more obvious impact on a single > transaction. The impact has to be measured over a far longer time frame than a few transactions. The fact it has impact on freespace reformation means it'll need accelerated aging tests done on it so we can be reasonably certain that it isn't going to bite us in extended production environments... Cheers, Dave.
On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote: > On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote: > > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote: > > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: ... > > > > > > In /theory/, this /should/ work. However, as the comment you removed > > > implies, there are likely to be issues with this as we get near > > > ENOSPC. We know that if we don't trim the AGFL right down to the > > > minimum requested as we approach ENOSPC we can get premature ENOSPC > > > events being reported that lead to filesystem shutdowns. (e.g. need > > > the very last free block for a BMBT block to complete a data extent > > > allocation). Hence I'd suggest that this needs to be aware of the > > > low space allocation algorithm (i.e. dfops->dop_low is true) to trim > > > the agfl right back when we are really short of space > > > > > > > Hmm, shouldn't the worst case bmbt requirements be satisfied by the > > block reservation of the transaction that maps the extent (or stored as > > indirect reservation if delalloc)? > > That's checked against global free space when we do the initial > reservation. It's not checked against AG free space until we make > the allocation attempt. > > The problem here is that if we don't leave enough space for the BMBT > block in the same AG as the data extent was allocated we can end up > with an ENOSPC on the BMBT block allocation because, say, AGF > locking order and all the higher AGFs are already ENOSPC. That's a > fatal ENOSPC error as the transaction is already dirty and will > shutdown the filesystem on transaction cancel. > I thought Christoph already addressed that problem in commit 255c516278 ("xfs: fix bogus minleft manipulations"). If not, that sounds like something that would need to be fixed independent from whether the associated AGFL happens to hold surplus blocks. AFAICT that problem could occur whether blocks sit on the AGFL as a surplus, legitimately sit on the AGFL or are simply allocated for some other purpose. > > I'm less concerned with premature > > ENOSPC from contexts where it's not fatal.. > > The definition of "premature ENOSPC" is "unexpected ENOSPC that > causes a filesystem shutdown" :/ > That's a bit pedantic. We've had plenty of discussions on the list using "premature ENOSPC" to describe errors that aren't necessarily fatal for the fs. Eofblocks trimming, sparse inodes and the broader work the minleft fix referenced above was part of are just a few obvious examples of context. > > > > IIRC, I think we've already > > incorporated changes (Christoph's minfree stuff rings a bell) into the > > allocator that explicitly prefer premature ENOSPC over potentially fatal > > conditions, but I could be mistaken. > > Yes, but that doesn't take into account all the crazy AGFL > manipulations that could occur as a result of trimming/extending the > AGFL. We /assume/ that we don't need signification reservations to > extend/trim the AGFL to what is required for the transaction. If we > now how to free 50 blocks from the AGFL back to the freespace tree > before we start the allocation/free operation, that's going to > have some impact on the reservations we have. > How does this patch exacerbate that problem? The current code performs N agfl frees per fixup* to drop below a dynamic watermark (xfs_alloc_min_freelist()). The patch changes that behavior to perform 1 agfl free per fixup unless we're over a fixed watermark (1/2 agfl size). ISTM that how far we can exceed that watermark in a given fixup (to be rectified by the next) is on a similar order in either case (and if anything, it seems like we factor out the case of recursively populating the AGFL and shrinking the watermark at the same time). * By fixup, I'm referring to one pass through xfs_alloc_fix_freelist(). > And think about 4k sector filesystems - there's close on 1000 > entries in the AGFL - this means we could have 500 blocks floating > on the AGFL that would otherwise be in the freespace tree. IOWs, > this change could have significant impact on freespace > fragmentation because blocks aren't getting returned to the > freespace tree where they merge with other small freespace extents > to reform larger extents. > Good point. I'm a little curious how we'd really end up with that many blocks on the agfl. The max requirement for the bnobt and cntbt is 4 each for a 1TB AG. The rmapbt adds another optional dependency, but IIRC that only adds another 4 or 5 more blocks. I suppose this might require some kind of sequence where consecutive agfl fixups end up freeing a block that results in one or more full tree joins, and thus we populate the agfl much faster than we shrink it (since we'd remove 1 block at a time) for some period of time. With regard to the impact on free space fragmentation.. I suppose we could mitigate that by setting the surplus limit to some multiple of the max possible agfl requirement instead of 1/2 the physical size of the agfl. We could also consider doing things like fixing up AGFL surpluses in a more safe contexts (i.e., in response to -ENOSPC on write where we also trim eofblocks, free up indlen reservations, etc.) or allowing a surplus block to be used in that BMBT block allocation failure scenario as a last resort to fs shutdown (and if there isn't a surplus block, then we'd be shutting down anyways). Note that I'm not currently convinced either of these are necessary, I'm just thinking out loud about how to deal with some of these potential hazards if they prove legitimate. > It's these sorts of issues (i.e. filesystem aging) that might not > show up for months of production use that we need to think about. It > may be that a larger AGFL helps here because it keeps a working > set of freespace tree blocks on the AGFL rather than having to > go back to the freespace trees all the time, but this is something > we need to measure, analyse and characterise before changing a > critical piece of the allocation architecture.... > I agree. I'm going to drop this patch from v2 for now because I don't want to hold up the other more straightforward transaction reservation fixes while we work out how best to fix this problem. > > We're also only talking about 256k > > or so for half of the AGFL, less than that if we assume that some of > > those blocks are actually required by the agfl and not slated to be > > lazily freed. We could also think about explicitly fixing up agfl > > surplus from other contexts (background, write -ENOSPC handling, etc.) > > if it became that much of a problem. > > > > I do think it's semi-reasonable from a conservative standpoint to > > further restrict this behavior to !dop_low conditions, but I'm a little > > concerned about creating an "open the floodgates" type situation when > > the agfl is allowed to carry extra blocks and then all of a sudden the > > free space state changes and one transaction is allowed to free a bunch > > of blocks. Thoughts? > > That's exactly the sort of problem we need to measure, analyse > and characterise before making this sort of change :/ > > > I suppose we could incorporate logic that frees until/unless a join > > occurs (i.e., the last free did not drop flcount), the latter being an > > indication that we've probably logged as much as we should for agfl > > fixups in the transaction. But that's also something we could just do > > unconditionally as opposed to only under dop_low conditions. That might > > be less aggressive of a change from current behavior. > > Heuristics will still need to be analysed and tested :/ > > The complexity of doing this is why I left a small comment rather > than actually making the change.... > > > > I'm also concerned that it doesn't take into account that freeing > > > a block from the AGFL could cause a freespace tree split to occur, > > > thereby emptying the AGFL whilst consuming the entire log > > > reservation for tree modifications. This leaves nothing in the log > > > reservation for re-growing the AGFL to the minimum required, which > > > we /must do/ before returning and could cause more splits/joins to > > > occur. > > > > > > > How is this different from the current code? > > It's not. I'm pointing out that while you're focussed on the > problems with shortening the AGFL, the same problem exists with > /extending the AGFL/. > > > This sounds to me like an > > unconditional side effect of the fact that freeing an agfl block can > > indirectly affect the agfl via the btree operations. IOW, freeing a > > single extra block could require consuming one or more and trigger the > > need for an allocation. I suspect the allocation could then very well > > cause a join on the other tree and put more than one block back onto the > > agfl. > > Yes, it could do that too. Remove a single block from an existing > free extent, no change to the by-block btree. by-cnt now requires a > record delete (full height join) followed by record insert elsewhere > in the tree (full height split). So the attempt to add a block to > the AGFL can actually shorten it if the by-cnt tree splits on > insert. It can grow if the by-block or by-cnt tree joins on record > removal. > > Either way, we've got the same problem of using the entire log > reservation for AGFL modification when growing the AGFL as we do > when trying to shrink the AGFL down. > > That's my point here - just hacking a limit into the shrink case > doesn't address the problem at all - it just papers over one of the > visible symptoms.... > Yes, it's clear that the current code allows for all sorts of theoretical avenues to transaction overrun. Hence my previous idea to roll the transaction once an AGFL fixup triggers a join or split. Even that may not be sufficient in certain scenarios. Moving on from that, this patch is a variant of your suggestion to allow leaving surplus blocks on the agfl up to a certain limit. It is intentionally a more isolated fix for the specific issue of performing too many independent allocations (frees) per transaction in this context. One approach doesn't have to preclude the other. I'm not aware of any pattern of overrun problems with this code over the however many years it has been in place, other than this one, however. Given that, I think it's perfectly reasonable to consider a shorter term solution so long as we're confident it doesn't introduce other problems. > > > IMO, there's a lot more to be concerned about here than just trying > > > to work around the specific symptom observed in the given test case. > > > This code is, unfortunately, really tricky and intricate and history > > > tells us that the risk of unexpected regressions is extremely high, > > > especially around ENOSPC related issues. Of all the patches in this > > > series, this is the most dangerous and "iffy" of them and the > > > one we should be most concerned and conservative about.... > > > > > > > Agreed. The impact is something that also has to be evaluated over a > > sequence of transactions along with the more obvious impact on a single > > transaction. > > The impact has to be measured over a far longer time frame than a > few transactions. The fact it has impact on freespace reformation > means it'll need accelerated aging tests done on it so we can be > reasonably certain that it isn't going to bite us in extended > production environments... > That's exactly what I mean by a sequence. E.g., the effect on the agfl over time. Clearly, the longer the sequence, the more robust the results. I'm not sure where the idea that a few transactions would provide anything useful comes from. This probably needs to be evaluated over many cycles of fully depopulating and repopulating the space btrees in different ways. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 29, 2017 at 01:24:53PM -0500, Brian Foster wrote: > On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote: > > On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote: > > > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote: > > > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: ... > > > This sounds to me like an > > > unconditional side effect of the fact that freeing an agfl block can > > > indirectly affect the agfl via the btree operations. IOW, freeing a > > > single extra block could require consuming one or more and trigger the > > > need for an allocation. I suspect the allocation could then very well > > > cause a join on the other tree and put more than one block back onto the > > > agfl. > > > > Yes, it could do that too. Remove a single block from an existing > > free extent, no change to the by-block btree. by-cnt now requires a > > record delete (full height join) followed by record insert elsewhere > > in the tree (full height split). So the attempt to add a block to > > the AGFL can actually shorten it if the by-cnt tree splits on > > insert. It can grow if the by-block or by-cnt tree joins on record > > removal. > > > > Either way, we've got the same problem of using the entire log > > reservation for AGFL modification when growing the AGFL as we do > > when trying to shrink the AGFL down. > > > > That's my point here - just hacking a limit into the shrink case > > doesn't address the problem at all - it just papers over one of the > > visible symptoms.... > > > > Yes, it's clear that the current code allows for all sorts of > theoretical avenues to transaction overrun. Hence my previous idea to > roll the transaction once an AGFL fixup triggers a join or split. Even > that may not be sufficient in certain scenarios. > Random thought from this afternoon... What do you think about unconditionally removing surplus agfl blocks as we do today, but defer them rather than free them immediately? We'd free up the agfl slots as needed so the allocation can proceed, but we'd eliminate the behavior where agfl frees recursively affect the agfl. The broader idea is that in the event where 2+ agfl frees are required, they'd now execute in a context where can enforce deterministic log consumption per tx (by also implementing the 2 frees per EFD idea, for example) until the agfl is rectified. I'd have to think a little more about whether the idea is sane.. It looks like we'd have to plumb dfops in through xfs_alloc_arg for starters. We could do the defer conditionally based on whether the caller passes dfops to facilitate incremental conversions. We also may be able to consider optimizations like putting deferred agfl blocks right back onto the agfl if there's a deficit by the time we get around to freeing them (rather than doing an agfl alloc only to free up deferred agfl blocks), but that's probably getting too far ahead for now. This also doesn't help with extending the agfl, but I think that path is already more deterministic since we attempt to fill the deficit in as few allocs as possible. Thoughts? Brian > Moving on from that, this patch is a variant of your suggestion to allow > leaving surplus blocks on the agfl up to a certain limit. It is > intentionally a more isolated fix for the specific issue of performing > too many independent allocations (frees) per transaction in this > context. > > One approach doesn't have to preclude the other. I'm not aware of any > pattern of overrun problems with this code over the however many years > it has been in place, other than this one, however. Given that, I think > it's perfectly reasonable to consider a shorter term solution so long as > we're confident it doesn't introduce other problems. > > > > > IMO, there's a lot more to be concerned about here than just trying > > > > to work around the specific symptom observed in the given test case. > > > > This code is, unfortunately, really tricky and intricate and history > > > > tells us that the risk of unexpected regressions is extremely high, > > > > especially around ENOSPC related issues. Of all the patches in this > > > > series, this is the most dangerous and "iffy" of them and the > > > > one we should be most concerned and conservative about.... > > > > > > > > > > Agreed. The impact is something that also has to be evaluated over a > > > sequence of transactions along with the more obvious impact on a single > > > transaction. > > > > The impact has to be measured over a far longer time frame than a > > few transactions. The fact it has impact on freespace reformation > > means it'll need accelerated aging tests done on it so we can be > > reasonably certain that it isn't going to bite us in extended > > production environments... > > > > That's exactly what I mean by a sequence. E.g., the effect on the agfl > over time. Clearly, the longer the sequence, the more robust the > results. I'm not sure where the idea that a few transactions would > provide anything useful comes from. This probably needs to be evaluated > over many cycles of fully depopulating and repopulating the space btrees > in different ways. > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 29, 2017 at 03:36:06PM -0500, Brian Foster wrote: > On Wed, Nov 29, 2017 at 01:24:53PM -0500, Brian Foster wrote: > > On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote: > > > On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote: > > > > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote: > > > > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote: > ... > > > > This sounds to me like an > > > > unconditional side effect of the fact that freeing an agfl block can > > > > indirectly affect the agfl via the btree operations. IOW, freeing a > > > > single extra block could require consuming one or more and trigger the > > > > need for an allocation. I suspect the allocation could then very well > > > > cause a join on the other tree and put more than one block back onto the > > > > agfl. > > > > > > Yes, it could do that too. Remove a single block from an existing > > > free extent, no change to the by-block btree. by-cnt now requires a > > > record delete (full height join) followed by record insert elsewhere > > > in the tree (full height split). So the attempt to add a block to > > > the AGFL can actually shorten it if the by-cnt tree splits on > > > insert. It can grow if the by-block or by-cnt tree joins on record > > > removal. > > > > > > Either way, we've got the same problem of using the entire log > > > reservation for AGFL modification when growing the AGFL as we do > > > when trying to shrink the AGFL down. > > > > > > That's my point here - just hacking a limit into the shrink case > > > doesn't address the problem at all - it just papers over one of the > > > visible symptoms.... > > > > > > > Yes, it's clear that the current code allows for all sorts of > > theoretical avenues to transaction overrun. Hence my previous idea to > > roll the transaction once an AGFL fixup triggers a join or split. Even > > that may not be sufficient in certain scenarios. > > > > Random thought from this afternoon... > > What do you think about unconditionally removing surplus agfl blocks as > we do today, but defer them rather than free them immediately? We'd free > up the agfl slots as needed so the allocation can proceed, but we'd > eliminate the behavior where agfl frees recursively affect the agfl. The > broader idea is that in the event where 2+ agfl frees are required, > they'd now execute in a context where can enforce deterministic log > consumption per tx (by also implementing the 2 frees per EFD idea, for > example) until the agfl is rectified. > > I'd have to think a little more about whether the idea is sane.. It > looks like we'd have to plumb dfops in through xfs_alloc_arg for > starters. We could do the defer conditionally based on whether the > caller passes dfops to facilitate incremental conversions. We also may > be able to consider optimizations like putting deferred agfl blocks > right back onto the agfl if there's a deficit by the time we get around > to freeing them (rather than doing an agfl alloc only to free up > deferred agfl blocks), but that's probably getting too far ahead for > now. > > This also doesn't help with extending the agfl, but I think that path is > already more deterministic since we attempt to fill the deficit in as > few allocs as possible. Thoughts? > I've hacked on the above a bit.. enough to at least determine that deferring AGFL frees does avoid the reservation overrun (without any of the recent transaction reservation fixes). To avoid the overrun, I had to defer AGFL frees from the inobt (xfs_inobt_free_block()) path and from the deferred ops xfs_trans_extent_free() path. To this point, I don't see any fundamental reason why we couldn't defer AGFL frees from any other path as well so long as we have a dfops structure available in the associated context. We'd have to deal with the following caveats due to how AGFL blocks differ from typical blocks freed via xfs_bmap_add_free(): 1.) AGFL blocks are accounted against different allocation counters. 2.) AGFL blocks are not marked busy on free (in fact, they are possibly already on the busy list by the time they are freed). I've currently hacked around these problems using the OWN_AG owner info to handle AGFL blocks appropriately down in xfs_trans_free_extent(). What I think may provide a cleaner implementation is to define a new deferred ops type specific to AGFL frees. This would be a subset of XFS_DEFER_OPS_TYPE_FREE and share nearly all of the implementation outside of ->finish_item(). The latter would be replaced with a callout that handles AGFL blocks as noted. Thoughts? With regard to AGFL allocation, I've not seen any overrun issues associated with that path. A thought that comes to mind for dealing with a problem in that area is to further genericize the above to do some kind of post alloc. fixup when we know an agfl block was consumed. For example, tag the perag as deficient for another context (thread/wq) to rectify or defer a more generic AGFL fixup item from agfl block consumption. We'd still have to implement some kind of serialization or do the alloc directly in the worst case, but the idea is that hopefully that would be rare and in most cases all AGFL fixups would now occur with a full tx reservation available for the fixup (agfl fixup-ahead, if you will). That would probably require some experimentation to determine how effective it would be without resorting to nasty locking tricks and whatnot. Note that I don't think anything as involved as the latter is currently necessary. As such, it's just unsubstantiated handwaving around how we could potentially evolve from deferred agfl frees to a more generic solution that covered all AGFL fixups. Brian > Brian > > > Moving on from that, this patch is a variant of your suggestion to allow > > leaving surplus blocks on the agfl up to a certain limit. It is > > intentionally a more isolated fix for the specific issue of performing > > too many independent allocations (frees) per transaction in this > > context. > > > > One approach doesn't have to preclude the other. I'm not aware of any > > pattern of overrun problems with this code over the however many years > > it has been in place, other than this one, however. Given that, I think > > it's perfectly reasonable to consider a shorter term solution so long as > > we're confident it doesn't introduce other problems. > > > > > > > IMO, there's a lot more to be concerned about here than just trying > > > > > to work around the specific symptom observed in the given test case. > > > > > This code is, unfortunately, really tricky and intricate and history > > > > > tells us that the risk of unexpected regressions is extremely high, > > > > > especially around ENOSPC related issues. Of all the patches in this > > > > > series, this is the most dangerous and "iffy" of them and the > > > > > one we should be most concerned and conservative about.... > > > > > > > > > > > > > Agreed. The impact is something that also has to be evaluated over a > > > > sequence of transactions along with the more obvious impact on a single > > > > transaction. > > > > > > The impact has to be measured over a far longer time frame than a > > > few transactions. The fact it has impact on freespace reformation > > > means it'll need accelerated aging tests done on it so we can be > > > reasonably certain that it isn't going to bite us in extended > > > production environments... > > > > > > > That's exactly what I mean by a sequence. E.g., the effect on the agfl > > over time. Clearly, the longer the sequence, the more robust the > > results. I'm not sure where the idea that a few transactions would > > provide anything useful comes from. This probably needs to be evaluated > > over many cycles of fully depopulating and repopulating the space btrees > > in different ways. > > > > Brian > > > > > Cheers, > > > > > > Dave. > > > -- > > > Dave Chinner > > > david@fromorbit.com > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c index 0da80019a917..d8d58e35da00 100644 --- a/fs/xfs/libxfs/xfs_alloc.c +++ b/fs/xfs/libxfs/xfs_alloc.c @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist( * appropriately based on the recursion count and dirty state of the * buffer. * - * XXX (dgc): When we have lots of free space, does this buy us - * anything other than extra overhead when we need to put more blocks - * back on the free list? Maybe we should only do this when space is - * getting low or the AGFL is more than half full? - * * The NOSHRINK flag prevents the AGFL from being shrunk if it's too * big; the NORMAP flag prevents AGFL expand/shrink operations from * updating the rmapbt. Both flags are used in xfs_repair while we're @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist( goto out_agbp_relse; } xfs_trans_binval(tp, bp); + + /* + * Freeing all extra agfl blocks adds too much log reservation + * overhead to a single transaction, particularly considering + * that freeing a block can cause a btree join and put one right + * back on the agfl. Try to free one block per tx so long as + * we've left enough free slots for the upcoming modifications. + */ + if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1)) + break; } targs.tp = tp;
We've had rare reports of transaction overruns in xfs_inactive_ifree() for quite some time. Analysis of a reproducing metadump has shown the problem is essentially caused by performing too many agfl block frees in a single transaction. For example, an inode chunk is freed and the associated agfl fixup algorithm discovers it needs to free a single agfl block before the chunk free occurs. This single block ends up causing a space btree join and adds one or more blocks back onto the agfl. This causes xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a single block discrepency. The transaction overrun occurs under several other unfortunate circumstances: - Each agfl block free is left+right contiguous. This requires 2 record deletions and 1 insertion for the cntbt and thus requires more log reservation than normal. - The associated transaction is the first in the CIL ctx and thus the ctx header reservation is consumed. - The transaction reservation is larger than a log buffer and thus extra split header reservation is consumed. As a result of the agfl and free space state of the filesystem, the agfl fixup code has dirtied more cntbt buffer space than allowed by the portion of the reservation allotted for block allocation. This is all before the real allocation even starts! Note that the log related conditions above are correctly covered by the existing transaction reservation. The above demonstrates that the reservation allotted for the context/split headers may help suppress overruns in the more common case where that reservation goes unused for its intended purpose. To address this problem, update xfs_alloc_fix_freelist() to amortize agfl block frees over multiple transactions. Free one block per transaction so long as the agfl is less than half free. The agfl minimum allocation requirement is dynamic, but is based on the geometry of the associated btrees (i.e., level count) and therefore should be easily rectified over multiple allocation transactions. Further, there is no real harm in leaving extraneous blocks on the agfl so long as there are enough free slots available for btree blocks freed as a result of the upcoming allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-)