Message ID | 20210222153442.897089-1-bfoster@redhat.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | xfs: don't reuse busy extents on extent trim | expand |
On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > Freed extents are marked busy from the point the freeing transaction > commits until the associated CIL context is checkpointed to the log. > This prevents reuse and overwrite of recently freed blocks before > the changes are committed to disk, which can lead to corruption > after a crash. The exception to this rule is that metadata > allocation is allowed to reuse busy extents because metadata changes > are also logged. > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > has allowed modification or complete invalidation of outstanding > busy extents for metadata allocations. This implementation assumes > that use of the associated extent is imminent, which is not always > the case. For example, the trimmed extent might not satisfy the > minimum length of the allocation request, or the allocation > algorithm might be involved in a search for the optimal result based > on locality. > > generic/019 reproduces a corruption caused by this scenario. First, > a metadata block (usually a bmbt or symlink block) is freed from an > inode. A subsequent bmbt split on an unrelated inode attempts a near > mode allocation request that invalidates the busy block during the > search, but does not ultimately allocate it. Due to the busy state > invalidation, the block is no longer considered busy to subsequent > allocation. A direct I/O write request immediately allocates the > block and writes to it. I really hope there's a fstest case coming for this... :) > Finally, the filesystem crashes while in a > state where the initial metadata block free had not committed to the > on-disk log. After recovery, the original metadata block is in its > original location as expected, but has been corrupted by the > aforementioned dio. Wheee! Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go find a freespace record, call xfs_extent_busy_trim (which could erase the busy extent entry), decide that it's not interested after all, and bail out without restoring the busy entry. Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as we wander around the free space btrees looking for a good chunk of space... and doesn't restore the busy record if it decides to consider a different extent. So I guess this "speculatively remove busy records and forget to restore them" behavior opens the door to the write allocating blocks that aren't yet free and nonbusy, right? And the solution presented here is to avoid letting go of the busy record for the bmbt allocation, and if the btree split caller decides it really /must/ have that block for the bmbt it can force the log and try again, just like we do for a file data allocation? Another solution could have been to restore the record if we decide not to go ahead with the allocation, but as we haven't yet committed to using the space, there's no sense in thrashing the busy records? Assuming I got all that right, Reviewed-by: Darrick J. Wong <djwong@kernel.org> --D > This demonstrates that it is fundamentally unsafe to modify busy > extent state for extents that are not guaranteed to be allocated. > This applies to pretty much all of the code paths that currently > trim busy extents for one reason or another. Therefore to address > this problem, drop the reuse mechanism from the busy extent trim > path. This code already knows how to return partial non-busy ranges > of the targeted free extent and higher level code tracks the busy > state of the allocation attempt. If a block allocation fails where > one or more candidate extents is busy, we force the log and retry > the allocation. > > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_extent_busy.c | 14 -------------- > 1 file changed, 14 deletions(-) > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > index 3991e59cfd18..ef17c1f6db32 100644 > --- a/fs/xfs/xfs_extent_busy.c > +++ b/fs/xfs/xfs_extent_busy.c > @@ -344,7 +344,6 @@ xfs_extent_busy_trim( > ASSERT(*len > 0); > > spin_lock(&args->pag->pagb_lock); > -restart: > fbno = *bno; > flen = *len; > rbp = args->pag->pagb_tree.rb_node; > @@ -363,19 +362,6 @@ xfs_extent_busy_trim( > continue; > } > > - /* > - * If this is a metadata allocation, try to reuse the busy > - * extent instead of trimming the allocation. > - */ > - if (!(args->datatype & XFS_ALLOC_USERDATA) && > - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { > - if (!xfs_extent_busy_update_extent(args->mp, args->pag, > - busyp, fbno, flen, > - false)) > - goto restart; > - continue; > - } > - > if (bbno <= fbno) { > /* start overlap */ > > -- > 2.26.2 >
On 22 Feb 2021 at 21:04, Brian Foster wrote: > Freed extents are marked busy from the point the freeing transaction > commits until the associated CIL context is checkpointed to the log. > This prevents reuse and overwrite of recently freed blocks before > the changes are committed to disk, which can lead to corruption > after a crash. The exception to this rule is that metadata > allocation is allowed to reuse busy extents because metadata changes > are also logged. > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > has allowed modification or complete invalidation of outstanding > busy extents for metadata allocations. This implementation assumes > that use of the associated extent is imminent, which is not always > the case. For example, the trimmed extent might not satisfy the > minimum length of the allocation request, or the allocation > algorithm might be involved in a search for the optimal result based > on locality. > > generic/019 reproduces a corruption caused by this scenario. First, > a metadata block (usually a bmbt or symlink block) is freed from an > inode. A subsequent bmbt split on an unrelated inode attempts a near > mode allocation request that invalidates the busy block during the > search, but does not ultimately allocate it. Due to the busy state > invalidation, the block is no longer considered busy to subsequent > allocation. A direct I/O write request immediately allocates the > block and writes to it. Finally, the filesystem crashes while in a > state where the initial metadata block free had not committed to the > on-disk log. After recovery, the original metadata block is in its > original location as expected, but has been corrupted by the > aforementioned dio. > > This demonstrates that it is fundamentally unsafe to modify busy > extent state for extents that are not guaranteed to be allocated. > This applies to pretty much all of the code paths that currently > trim busy extents for one reason or another. Therefore to address > this problem, drop the reuse mechanism from the busy extent trim > path. This code already knows how to return partial non-busy ranges > of the targeted free extent and higher level code tracks the busy > state of the allocation attempt. If a block allocation fails where > one or more candidate extents is busy, we force the log and retry > the allocation. > The changes look good to me. Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> > Signed-off-by: Brian Foster <bfoster@redhat.com> > --- > fs/xfs/xfs_extent_busy.c | 14 -------------- > 1 file changed, 14 deletions(-) > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > index 3991e59cfd18..ef17c1f6db32 100644 > --- a/fs/xfs/xfs_extent_busy.c > +++ b/fs/xfs/xfs_extent_busy.c > @@ -344,7 +344,6 @@ xfs_extent_busy_trim( > ASSERT(*len > 0); > > spin_lock(&args->pag->pagb_lock); > -restart: > fbno = *bno; > flen = *len; > rbp = args->pag->pagb_tree.rb_node; > @@ -363,19 +362,6 @@ xfs_extent_busy_trim( > continue; > } > > - /* > - * If this is a metadata allocation, try to reuse the busy > - * extent instead of trimming the allocation. > - */ > - if (!(args->datatype & XFS_ALLOC_USERDATA) && > - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { > - if (!xfs_extent_busy_update_extent(args->mp, args->pag, > - busyp, fbno, flen, > - false)) > - goto restart; > - continue; > - } > - > if (bbno <= fbno) { > /* start overlap */
On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > Freed extents are marked busy from the point the freeing transaction > > commits until the associated CIL context is checkpointed to the log. > > This prevents reuse and overwrite of recently freed blocks before > > the changes are committed to disk, which can lead to corruption > > after a crash. The exception to this rule is that metadata > > allocation is allowed to reuse busy extents because metadata changes > > are also logged. > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > has allowed modification or complete invalidation of outstanding > > busy extents for metadata allocations. This implementation assumes > > that use of the associated extent is imminent, which is not always > > the case. For example, the trimmed extent might not satisfy the > > minimum length of the allocation request, or the allocation > > algorithm might be involved in a search for the optimal result based > > on locality. > > > > generic/019 reproduces a corruption caused by this scenario. First, > > a metadata block (usually a bmbt or symlink block) is freed from an > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > mode allocation request that invalidates the busy block during the > > search, but does not ultimately allocate it. Due to the busy state > > invalidation, the block is no longer considered busy to subsequent > > allocation. A direct I/O write request immediately allocates the > > block and writes to it. > > I really hope there's a fstest case coming for this... :) > generic/019? :) I'm not sure of a good way to reproduce on demand given the conditions required to reproduce. > > Finally, the filesystem crashes while in a > > state where the initial metadata block free had not committed to the > > on-disk log. After recovery, the original metadata block is in its > > original location as expected, but has been corrupted by the > > aforementioned dio. > > Wheee! > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go > find a freespace record, call xfs_extent_busy_trim (which could erase > the busy extent entry), decide that it's not interested after all, and > bail out without restoring the busy entry. > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as > we wander around the free space btrees looking for a good chunk of > space... and doesn't restore the busy record if it decides to consider a > different extent. > Yep. I was originally curious whether the more recent allocator rework introduced this problem somehow, but AFAICT that just refactored the relevant allocator code and this bug has been latent in the existing code for quite some time. That's not hugely surprising given the rare combination of conditions required to reproduce. > So I guess this "speculatively remove busy records and forget to restore > them" behavior opens the door to the write allocating blocks that aren't > yet free and nonbusy, right? And the solution presented here is to > avoid letting go of the busy record for the bmbt allocation, and if the > btree split caller decides it really /must/ have that block for the bmbt > it can force the log and try again, just like we do for a file data > allocation? > Yes, pretty much. The metadata allocation that is allowed to safely reuse busy extents ends up invalidating a set of blocks during a NEAR mode search (i.e. bmbt allocation), but ends up only using one of those blocks. A data allocation immediately comes along next, finds one of the other invalidated blocks and writes to it. A crash/recovery leaves the invalidated busy block in its original metadata location having already been written to by the dio. > Another solution could have been to restore the record if we decide not > to go ahead with the allocation, but as we haven't yet committed to > using the space, there's no sense in thrashing the busy records? > That was my original thought as well. Then after looking through the code a bit I thought that something like allowing the allocator to "track" a reusable, but still busy extent until allocation is imminent might be a bit more straightforward of an implementation given the layering between the allocator and busy extent tracking code. IOW, we'd split the busy trim/available and busy invalidate logic into two steps instead of doing it immediately in the busy trim path. That would allow the allocator to consider the same set of reusable busy blocks but not commit to any of them until the allocation search is complete. However, either of those options require a bit of thought and rework (and perhaps some value proposition justification for the complexity) while the current trim reuse code is pretty much bolted on and broken. Therefore, I think it's appropriate to fix the bug in one step and follow up with a different implementation separately. > Assuming I got all that right, > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > Thanks. Brian > --D > > > > This demonstrates that it is fundamentally unsafe to modify busy > > extent state for extents that are not guaranteed to be allocated. > > This applies to pretty much all of the code paths that currently > > trim busy extents for one reason or another. Therefore to address > > this problem, drop the reuse mechanism from the busy extent trim > > path. This code already knows how to return partial non-busy ranges > > of the targeted free extent and higher level code tracks the busy > > state of the allocation attempt. If a block allocation fails where > > one or more candidate extents is busy, we force the log and retry > > the allocation. > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > --- > > fs/xfs/xfs_extent_busy.c | 14 -------------- > > 1 file changed, 14 deletions(-) > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > index 3991e59cfd18..ef17c1f6db32 100644 > > --- a/fs/xfs/xfs_extent_busy.c > > +++ b/fs/xfs/xfs_extent_busy.c > > @@ -344,7 +344,6 @@ xfs_extent_busy_trim( > > ASSERT(*len > 0); > > > > spin_lock(&args->pag->pagb_lock); > > -restart: > > fbno = *bno; > > flen = *len; > > rbp = args->pag->pagb_tree.rb_node; > > @@ -363,19 +362,6 @@ xfs_extent_busy_trim( > > continue; > > } > > > > - /* > > - * If this is a metadata allocation, try to reuse the busy > > - * extent instead of trimming the allocation. > > - */ > > - if (!(args->datatype & XFS_ALLOC_USERDATA) && > > - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { > > - if (!xfs_extent_busy_update_extent(args->mp, args->pag, > > - busyp, fbno, flen, > > - false)) > > - goto restart; > > - continue; > > - } > > - > > if (bbno <= fbno) { > > /* start overlap */ > > > > -- > > 2.26.2 > > >
On Tue, Feb 23, 2021 at 07:31:06AM -0500, Brian Foster wrote: > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > > Freed extents are marked busy from the point the freeing transaction > > > commits until the associated CIL context is checkpointed to the log. > > > This prevents reuse and overwrite of recently freed blocks before > > > the changes are committed to disk, which can lead to corruption > > > after a crash. The exception to this rule is that metadata > > > allocation is allowed to reuse busy extents because metadata changes > > > are also logged. > > > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > > has allowed modification or complete invalidation of outstanding > > > busy extents for metadata allocations. This implementation assumes > > > that use of the associated extent is imminent, which is not always > > > the case. For example, the trimmed extent might not satisfy the > > > minimum length of the allocation request, or the allocation > > > algorithm might be involved in a search for the optimal result based > > > on locality. > > > > > > generic/019 reproduces a corruption caused by this scenario. First, > > > a metadata block (usually a bmbt or symlink block) is freed from an > > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > > mode allocation request that invalidates the busy block during the > > > search, but does not ultimately allocate it. Due to the busy state > > > invalidation, the block is no longer considered busy to subsequent > > > allocation. A direct I/O write request immediately allocates the > > > block and writes to it. > > > > I really hope there's a fstest case coming for this... :) > > > > generic/019? :) I'm not sure of a good way to reproduce on demand given > the conditions required to reproduce. <nod> I guess you'd have to have a fs where extents take a long time to exit the busy tree, and then set up the allocations and frees just right. FWIW I've never hit this in generic/019. > > > Finally, the filesystem crashes while in a > > > state where the initial metadata block free had not committed to the > > > on-disk log. After recovery, the original metadata block is in its > > > original location as expected, but has been corrupted by the > > > aforementioned dio. > > > > Wheee! > > > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go > > find a freespace record, call xfs_extent_busy_trim (which could erase > > the busy extent entry), decide that it's not interested after all, and > > bail out without restoring the busy entry. > > > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as > > we wander around the free space btrees looking for a good chunk of > > space... and doesn't restore the busy record if it decides to consider a > > different extent. > > > > Yep. I was originally curious whether the more recent allocator rework > introduced this problem somehow, but AFAICT that just refactored the > relevant allocator code and this bug has been latent in the existing > code for quite some time. That's not hugely surprising given the rare > combination of conditions required to reproduce. > > > So I guess this "speculatively remove busy records and forget to restore > > them" behavior opens the door to the write allocating blocks that aren't > > yet free and nonbusy, right? And the solution presented here is to > > avoid letting go of the busy record for the bmbt allocation, and if the > > btree split caller decides it really /must/ have that block for the bmbt > > it can force the log and try again, just like we do for a file data > > allocation? > > > > Yes, pretty much. The metadata allocation that is allowed to safely > reuse busy extents ends up invalidating a set of blocks during a NEAR > mode search (i.e. bmbt allocation), but ends up only using one of those > blocks. A data allocation immediately comes along next, finds one of the > other invalidated blocks and writes to it. A crash/recovery leaves the > invalidated busy block in its original metadata location having already > been written to by the dio. > > > Another solution could have been to restore the record if we decide not > > to go ahead with the allocation, but as we haven't yet committed to > > using the space, there's no sense in thrashing the busy records? > > > > That was my original thought as well. Then after looking through the > code a bit I thought that something like allowing the allocator to > "track" a reusable, but still busy extent until allocation is imminent > might be a bit more straightforward of an implementation given the > layering between the allocator and busy extent tracking code. IOW, we'd > split the busy trim/available and busy invalidate logic into two steps > instead of doing it immediately in the busy trim path. That would allow > the allocator to consider the same set of reusable busy blocks but not > commit to any of them until the allocation search is complete. > > However, either of those options require a bit of thought and rework > (and perhaps some value proposition justification for the complexity) > while the current trim reuse code is pretty much bolted on and broken. > Therefore, I think it's appropriate to fix the bug in one step and > follow up with a different implementation separately. <nod> I'm not sure that's even worth the effort... :) --D > > Assuming I got all that right, > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > Thanks. > > Brian > > > --D > > > > > > > This demonstrates that it is fundamentally unsafe to modify busy > > > extent state for extents that are not guaranteed to be allocated. > > > This applies to pretty much all of the code paths that currently > > > trim busy extents for one reason or another. Therefore to address > > > this problem, drop the reuse mechanism from the busy extent trim > > > path. This code already knows how to return partial non-busy ranges > > > of the targeted free extent and higher level code tracks the busy > > > state of the allocation attempt. If a block allocation fails where > > > one or more candidate extents is busy, we force the log and retry > > > the allocation. > > > > > > Signed-off-by: Brian Foster <bfoster@redhat.com> > > > --- > > > fs/xfs/xfs_extent_busy.c | 14 -------------- > > > 1 file changed, 14 deletions(-) > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c > > > index 3991e59cfd18..ef17c1f6db32 100644 > > > --- a/fs/xfs/xfs_extent_busy.c > > > +++ b/fs/xfs/xfs_extent_busy.c > > > @@ -344,7 +344,6 @@ xfs_extent_busy_trim( > > > ASSERT(*len > 0); > > > > > > spin_lock(&args->pag->pagb_lock); > > > -restart: > > > fbno = *bno; > > > flen = *len; > > > rbp = args->pag->pagb_tree.rb_node; > > > @@ -363,19 +362,6 @@ xfs_extent_busy_trim( > > > continue; > > > } > > > > > > - /* > > > - * If this is a metadata allocation, try to reuse the busy > > > - * extent instead of trimming the allocation. > > > - */ > > > - if (!(args->datatype & XFS_ALLOC_USERDATA) && > > > - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { > > > - if (!xfs_extent_busy_update_extent(args->mp, args->pag, > > > - busyp, fbno, flen, > > > - false)) > > > - goto restart; > > > - continue; > > > - } > > > - > > > if (bbno <= fbno) { > > > /* start overlap */ > > > > > > -- > > > 2.26.2 > > > > > >
As a quick fix this looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
That beeing said we really need to go back and look into this,
especially due to discards. For SSDs it is generlly much better to
quickly reuse freed blocks rather than discarding them later.
On Thu, Feb 25, 2021 at 07:51:05AM +0000, Christoph Hellwig wrote: > As a quick fix this looks good: > > Reviewed-by: Christoph Hellwig <hch@lst.de> > > That beeing said we really need to go back and look into this, > especially due to discards. For SSDs it is generlly much better to > quickly reuse freed blocks rather than discarding them later. > Ok, that's an interesting point. I'm not sure online discard is a super critical use case, but I agree that there's a tangible advantage to optimizing out pending discards in that configuration. That also raises a caveat with the alternative implementation I was mulling over. The current implementation simply skips over extents that are busy and already under discard. If we were to find a prospective reusable busy block (not under discard), allocate it, and then commit to reusing it, we'd have to deal with the fact that we could find it under discard at that point. We can't easily skip it because we've already performed an allocation in the transaction by that point. I suspect the simplest solution is just wait for the discard to complete since we already have somewhat of a mechanism to do that, but hopefully it wouldn't be a frequent occurence. Brian
On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote: > > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > > Freed extents are marked busy from the point the freeing transaction > > > commits until the associated CIL context is checkpointed to the log. > > > This prevents reuse and overwrite of recently freed blocks before > > > the changes are committed to disk, which can lead to corruption > > > after a crash. The exception to this rule is that metadata > > > allocation is allowed to reuse busy extents because metadata changes > > > are also logged. > > > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > > has allowed modification or complete invalidation of outstanding > > > busy extents for metadata allocations. This implementation assumes > > > that use of the associated extent is imminent, which is not always > > > the case. For example, the trimmed extent might not satisfy the > > > minimum length of the allocation request, or the allocation > > > algorithm might be involved in a search for the optimal result based > > > on locality. > > > > > > generic/019 reproduces a corruption caused by this scenario. First, > > > a metadata block (usually a bmbt or symlink block) is freed from an > > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > > mode allocation request that invalidates the busy block during the > > > search, but does not ultimately allocate it. Due to the busy state > > > invalidation, the block is no longer considered busy to subsequent > > > allocation. A direct I/O write request immediately allocates the > > > block and writes to it. > > > > I really hope there's a fstest case coming for this... :) > > > > generic/019? :) I'm not sure of a good way to reproduce on demand given > the conditions required to reproduce. > > > > Finally, the filesystem crashes while in a > > > state where the initial metadata block free had not committed to the > > > on-disk log. After recovery, the original metadata block is in its > > > original location as expected, but has been corrupted by the > > > aforementioned dio. > > > > Wheee! > > > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go > > find a freespace record, call xfs_extent_busy_trim (which could erase > > the busy extent entry), decide that it's not interested after all, and > > bail out without restoring the busy entry. > > > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as > > we wander around the free space btrees looking for a good chunk of > > space... and doesn't restore the busy record if it decides to consider a > > different extent. > > > > Yep. I was originally curious whether the more recent allocator rework > introduced this problem somehow, but AFAICT that just refactored the > relevant allocator code and this bug has been latent in the existing > code for quite some time. That's not hugely surprising given the rare > combination of conditions required to reproduce. > > > So I guess this "speculatively remove busy records and forget to restore > > them" behavior opens the door to the write allocating blocks that aren't > > yet free and nonbusy, right? And the solution presented here is to > > avoid letting go of the busy record for the bmbt allocation, and if the > > btree split caller decides it really /must/ have that block for the bmbt > > it can force the log and try again, just like we do for a file data > > allocation? > > > > Yes, pretty much. The metadata allocation that is allowed to safely > reuse busy extents ends up invalidating a set of blocks during a NEAR > mode search (i.e. bmbt allocation), but ends up only using one of those > blocks. A data allocation immediately comes along next, finds one of the > other invalidated blocks and writes to it. A crash/recovery leaves the > invalidated busy block in its original metadata location having already > been written to by the dio. > > > Another solution could have been to restore the record if we decide not > > to go ahead with the allocation, but as we haven't yet committed to > > using the space, there's no sense in thrashing the busy records? > > > > That was my original thought as well. Then after looking through the > code a bit I thought that something like allowing the allocator to > "track" a reusable, but still busy extent until allocation is imminent > might be a bit more straightforward of an implementation given the > layering between the allocator and busy extent tracking code. IOW, we'd > split the busy trim/available and busy invalidate logic into two steps > instead of doing it immediately in the busy trim path. That would allow > the allocator to consider the same set of reusable busy blocks but not > commit to any of them until the allocation search is complete. > > However, either of those options require a bit of thought and rework > (and perhaps some value proposition justification for the complexity) > while the current trim reuse code is pretty much bolted on and broken. > Therefore, I think it's appropriate to fix the bug in one step and > follow up with a different implementation separately. > Hi Brian, This patch was one of my selected fixes to backport for v5.10.y. It has a very scary looking commit message and the change seems to be independent of any infrastructure changes(?). The problem is that applying this patch to v5.10.y reliably reproduces this buffer corruption assertion [*] with test xfs/076. This happens on the kdevops system that is using loop devices over sparse files inside qemu images. It does not reproduce on my small VM at home. Normally, I would just drop this patch from the stable candidates queue and move on, but I thought you might be interested to investigate this reliable reproducer, because maybe this system exercises an error that is otherwise rare to hit. It seemed weird to me that NOT reusing the extent would result in data corruption, but it could indicate that reusing the extent was masking the assertion and hiding another bug(?). Can you think of another reason to explain the regression this fix introduces to 5.10.y? Do you care to investigate this failure or shall I just move on? Thanks, Amir. [*] : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file fs/xfs/xfs_trans.c. Caller xfs_create+0x22f/0x590 [xfs] : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G E 5.10.109-xfs-2 #8 : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 : Call Trace: : dump_stack+0x6d/0x88 : xfs_trans_cancel+0x17b/0x1a0 [xfs] : xfs_create+0x22f/0x590 [xfs] : xfs_generic_create+0x245/0x310 [xfs] : ? d_splice_alias+0x13a/0x3c0 : path_openat+0xe3f/0x1080 : do_filp_open+0x93/0x100 : ? handle_mm_fault+0x148e/0x1690 : ? __check_object_size+0x162/0x180 : do_sys_openat2+0x228/0x2d0 : do_sys_open+0x4b/0x80 : do_syscall_64+0x33/0x80 : entry_SYSCALL_64_after_hwframe+0x44/0xa9 : RIP: 0033:0x7f36b02eff1e : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0 : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002 : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001 : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764 : XFS (loop5): Corruption of in-memory data detected. Shutting down filesystem : XFS (loop5): Please unmount the filesystem and rectify the problem(s)
On Thu, May 26, 2022 at 02:34:38PM +0300, Amir Goldstein wrote: > On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote: > > > > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > > > Freed extents are marked busy from the point the freeing transaction > > > > commits until the associated CIL context is checkpointed to the log. > > > > This prevents reuse and overwrite of recently freed blocks before > > > > the changes are committed to disk, which can lead to corruption > > > > after a crash. The exception to this rule is that metadata > > > > allocation is allowed to reuse busy extents because metadata changes > > > > are also logged. > > > > > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > > > has allowed modification or complete invalidation of outstanding > > > > busy extents for metadata allocations. This implementation assumes > > > > that use of the associated extent is imminent, which is not always > > > > the case. For example, the trimmed extent might not satisfy the > > > > minimum length of the allocation request, or the allocation > > > > algorithm might be involved in a search for the optimal result based > > > > on locality. > > > > > > > > generic/019 reproduces a corruption caused by this scenario. First, > > > > a metadata block (usually a bmbt or symlink block) is freed from an > > > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > > > mode allocation request that invalidates the busy block during the > > > > search, but does not ultimately allocate it. Due to the busy state > > > > invalidation, the block is no longer considered busy to subsequent > > > > allocation. A direct I/O write request immediately allocates the > > > > block and writes to it. > > > > > > I really hope there's a fstest case coming for this... :) > > > > > > > generic/019? :) I'm not sure of a good way to reproduce on demand given > > the conditions required to reproduce. > > > > > > Finally, the filesystem crashes while in a > > > > state where the initial metadata block free had not committed to the > > > > on-disk log. After recovery, the original metadata block is in its > > > > original location as expected, but has been corrupted by the > > > > aforementioned dio. > > > > > > Wheee! > > > > > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go > > > find a freespace record, call xfs_extent_busy_trim (which could erase > > > the busy extent entry), decide that it's not interested after all, and > > > bail out without restoring the busy entry. > > > > > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as > > > we wander around the free space btrees looking for a good chunk of > > > space... and doesn't restore the busy record if it decides to consider a > > > different extent. > > > > > > > Yep. I was originally curious whether the more recent allocator rework > > introduced this problem somehow, but AFAICT that just refactored the > > relevant allocator code and this bug has been latent in the existing > > code for quite some time. That's not hugely surprising given the rare > > combination of conditions required to reproduce. > > > > > So I guess this "speculatively remove busy records and forget to restore > > > them" behavior opens the door to the write allocating blocks that aren't > > > yet free and nonbusy, right? And the solution presented here is to > > > avoid letting go of the busy record for the bmbt allocation, and if the > > > btree split caller decides it really /must/ have that block for the bmbt > > > it can force the log and try again, just like we do for a file data > > > allocation? > > > > > > > Yes, pretty much. The metadata allocation that is allowed to safely > > reuse busy extents ends up invalidating a set of blocks during a NEAR > > mode search (i.e. bmbt allocation), but ends up only using one of those > > blocks. A data allocation immediately comes along next, finds one of the > > other invalidated blocks and writes to it. A crash/recovery leaves the > > invalidated busy block in its original metadata location having already > > been written to by the dio. > > > > > Another solution could have been to restore the record if we decide not > > > to go ahead with the allocation, but as we haven't yet committed to > > > using the space, there's no sense in thrashing the busy records? > > > > > > > That was my original thought as well. Then after looking through the > > code a bit I thought that something like allowing the allocator to > > "track" a reusable, but still busy extent until allocation is imminent > > might be a bit more straightforward of an implementation given the > > layering between the allocator and busy extent tracking code. IOW, we'd > > split the busy trim/available and busy invalidate logic into two steps > > instead of doing it immediately in the busy trim path. That would allow > > the allocator to consider the same set of reusable busy blocks but not > > commit to any of them until the allocation search is complete. > > > > However, either of those options require a bit of thought and rework > > (and perhaps some value proposition justification for the complexity) > > while the current trim reuse code is pretty much bolted on and broken. > > Therefore, I think it's appropriate to fix the bug in one step and > > follow up with a different implementation separately. > > > > Hi Brian, > > This patch was one of my selected fixes to backport for v5.10.y. > It has a very scary looking commit message and the change seems > to be independent of any infrastructure changes(?). > > The problem is that applying this patch to v5.10.y reliably reproduces > this buffer corruption assertion [*] with test xfs/076. > > This happens on the kdevops system that is using loop devices over > sparse files inside qemu images. It does not reproduce on my small > VM at home. > > Normally, I would just drop this patch from the stable candidates queue > and move on, but I thought you might be interested to investigate this > reliable reproducer, because maybe this system exercises an error > that is otherwise rare to hit. > > It seemed weird to me that NOT reusing the extent would result in > data corruption, but it could indicate that reusing the extent was masking > the assertion and hiding another bug(?). > Indeed, this does seem like an odd failure. The shutdown on transaction cancel implies cancellation of a dirty transaction. This is not necessarily corruption as much as just being the generic naming/messaging related to shutdowns due to unexpected in-core state. The patch in question removes some modifications to in-core busy extent state during extent allocation that are fundamentally unsafe in combination with how allocation works. This change doesn't appear to affect any transaction directly, so the correlation may be indirect. xfs/076 looks like it's a sparse inode allocation test, which certainly seems relevant in that it is stressing the ability to allocate inode chunks under free space fragmentation. If this patch further restricts extent allocation by removing availability of some set of (recently freed, busy) extents, then perhaps there is some allocation failure sequence that was previously unlikely enough to mask some poor error handling logic or transaction handling (like an agfl fixup dirtying a transaction followed by an allocation failure, for example) that we're now running into. > Can you think of another reason to explain the regression this fix > introduces to 5.10.y? > Not off the top of my head. Something along the lines of the above seems plausible, but that's just speculation at this point. > Do you care to investigate this failure or shall I just move on? > I think it would be good to understand whether there's a regression introduced by this patch, a bug somewhere else or just some impedence mismatch in logic between the combination of this change and whatever else happens to be in v5.10.y. Unfortunately I'm not able to reproduce if I pull just this commit back into latest 5.10.y (5.10.118). I've tried with a traditional bdev as well as a preallocated and sparse loopback scratch dev. Have you tested this patch (backport) in isolation in your reproducer env or only in combination with other pending backports? Brian > Thanks, > Amir. > > [*] > : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file > fs/xfs/xfs_trans.c. Caller xfs_create+0x22f/0x590 [xfs] > : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G E > 5.10.109-xfs-2 #8 > : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > 04/01/2014 > : Call Trace: > : dump_stack+0x6d/0x88 > : xfs_trans_cancel+0x17b/0x1a0 [xfs] > : xfs_create+0x22f/0x590 [xfs] > : xfs_generic_create+0x245/0x310 [xfs] > : ? d_splice_alias+0x13a/0x3c0 > : path_openat+0xe3f/0x1080 > : do_filp_open+0x93/0x100 > : ? handle_mm_fault+0x148e/0x1690 > : ? __check_object_size+0x162/0x180 > : do_sys_openat2+0x228/0x2d0 > : do_sys_open+0x4b/0x80 > : do_syscall_64+0x33/0x80 > : entry_SYSCALL_64_after_hwframe+0x44/0xa9 > : RIP: 0033:0x7f36b02eff1e > : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00 > 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d > 00 f0 ff ff 0 > : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e > : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c > : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 > : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002 > : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001 > : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file > fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764 > : XFS (loop5): Corruption of in-memory data detected. Shutting down filesystem > : XFS (loop5): Please unmount the filesystem and rectify the problem(s) >
> > Hi Brian, > > > > This patch was one of my selected fixes to backport for v5.10.y. > > It has a very scary looking commit message and the change seems > > to be independent of any infrastructure changes(?). > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > this buffer corruption assertion [*] with test xfs/076. > > > > This happens on the kdevops system that is using loop devices over > > sparse files inside qemu images. It does not reproduce on my small > > VM at home. > > > > Normally, I would just drop this patch from the stable candidates queue > > and move on, but I thought you might be interested to investigate this > > reliable reproducer, because maybe this system exercises an error > > that is otherwise rare to hit. > > > > It seemed weird to me that NOT reusing the extent would result in > > data corruption, but it could indicate that reusing the extent was masking > > the assertion and hiding another bug(?). > > > > Indeed, this does seem like an odd failure. The shutdown on transaction > cancel implies cancellation of a dirty transaction. This is not > necessarily corruption as much as just being the generic > naming/messaging related to shutdowns due to unexpected in-core state. > The patch in question removes some modifications to in-core busy extent > state during extent allocation that are fundamentally unsafe in > combination with how allocation works. This change doesn't appear to > affect any transaction directly, so the correlation may be indirect. > > xfs/076 looks like it's a sparse inode allocation test, which certainly > seems relevant in that it is stressing the ability to allocate inode > chunks under free space fragmentation. If this patch further restricts > extent allocation by removing availability of some set of (recently > freed, busy) extents, then perhaps there is some allocation failure > sequence that was previously unlikely enough to mask some poor error > handling logic or transaction handling (like an agfl fixup dirtying a > transaction followed by an allocation failure, for example) that we're > now running into. > > > Can you think of another reason to explain the regression this fix > > introduces to 5.10.y? > > > > Not off the top of my head. Something along the lines of the above seems > plausible, but that's just speculation at this point. > > > Do you care to investigate this failure or shall I just move on? > > > > I think it would be good to understand whether there's a regression > introduced by this patch, a bug somewhere else or just some impedence > mismatch in logic between the combination of this change and whatever > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce > if I pull just this commit back into latest 5.10.y (5.10.118). I've > tried with a traditional bdev as well as a preallocated and sparse > loopback scratch dev. I also failed to reproduce it on another VM, but it reproduces reliably on this system. That's why I thought we'd better use this opportunity. This system has lots of RAM and disk to spare so I have no problem running this test in a VM in parallel to my work. It is not actually my system, it's a system that Luis has setup for stable XFS testing and gave me access to, so if the need arises you could get direct access to the system, but for now, I have no problem running the test for you. > Have you tested this patch (backport) in isolation > in your reproducer env or only in combination with other pending > backports? > I tested it on top of 5.10.109 + these 5 patches: https://github.com/amir73il/linux/commits/xfs-5.10.y-1 I can test it in isolation if you like. Let me know if there are other forensics that you would like me to collect. Thanks, Amir.
On Thu, May 26, 2022 at 6:28 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > Hi Brian, > > > > > > This patch was one of my selected fixes to backport for v5.10.y. > > > It has a very scary looking commit message and the change seems > > > to be independent of any infrastructure changes(?). > > > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > > this buffer corruption assertion [*] with test xfs/076. > > > > > > This happens on the kdevops system that is using loop devices over > > > sparse files inside qemu images. It does not reproduce on my small > > > VM at home. > > > > > > Normally, I would just drop this patch from the stable candidates queue > > > and move on, but I thought you might be interested to investigate this > > > reliable reproducer, because maybe this system exercises an error > > > that is otherwise rare to hit. > > > > > > It seemed weird to me that NOT reusing the extent would result in > > > data corruption, but it could indicate that reusing the extent was masking > > > the assertion and hiding another bug(?). > > > > > > > Indeed, this does seem like an odd failure. The shutdown on transaction > > cancel implies cancellation of a dirty transaction. This is not > > necessarily corruption as much as just being the generic > > naming/messaging related to shutdowns due to unexpected in-core state. > > The patch in question removes some modifications to in-core busy extent > > state during extent allocation that are fundamentally unsafe in > > combination with how allocation works. This change doesn't appear to > > affect any transaction directly, so the correlation may be indirect. > > > > xfs/076 looks like it's a sparse inode allocation test, which certainly > > seems relevant in that it is stressing the ability to allocate inode > > chunks under free space fragmentation. If this patch further restricts > > extent allocation by removing availability of some set of (recently > > freed, busy) extents, then perhaps there is some allocation failure > > sequence that was previously unlikely enough to mask some poor error > > handling logic or transaction handling (like an agfl fixup dirtying a > > transaction followed by an allocation failure, for example) that we're > > now running into. > > > > > Can you think of another reason to explain the regression this fix > > > introduces to 5.10.y? > > > > > > > Not off the top of my head. Something along the lines of the above seems > > plausible, but that's just speculation at this point. > > > > > Do you care to investigate this failure or shall I just move on? > > > > > > > I think it would be good to understand whether there's a regression > > introduced by this patch, a bug somewhere else or just some impedence > > mismatch in logic between the combination of this change and whatever > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce > > if I pull just this commit back into latest 5.10.y (5.10.118). I've > > tried with a traditional bdev as well as a preallocated and sparse > > loopback scratch dev. > > I also failed to reproduce it on another VM, but it reproduces reliably > on this system. That's why I thought we'd better use this opportunity. > This system has lots of RAM and disk to spare so I have no problem > running this test in a VM in parallel to my work. > > It is not actually my system, it's a system that Luis has setup for > stable XFS testing and gave me access to, so if the need arises > you could get direct access to the system, but for now, I have no > problem running the test for you. > > > Have you tested this patch (backport) in isolation > > in your reproducer env or only in combination with other pending > > backports? > > > > I tested it on top of 5.10.109 + these 5 patches: > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > I can test it in isolation if you like. Let me know if there are > other forensics that you would like me to collect. >. FWIW, the test fails with just that one patch over 5.10.109. Thanks, Amir.
On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote: > > > Hi Brian, > > > > > > This patch was one of my selected fixes to backport for v5.10.y. > > > It has a very scary looking commit message and the change seems > > > to be independent of any infrastructure changes(?). > > > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > > this buffer corruption assertion [*] with test xfs/076. > > > > > > This happens on the kdevops system that is using loop devices over > > > sparse files inside qemu images. It does not reproduce on my small > > > VM at home. > > > > > > Normally, I would just drop this patch from the stable candidates queue > > > and move on, but I thought you might be interested to investigate this > > > reliable reproducer, because maybe this system exercises an error > > > that is otherwise rare to hit. > > > > > > It seemed weird to me that NOT reusing the extent would result in > > > data corruption, but it could indicate that reusing the extent was masking > > > the assertion and hiding another bug(?). > > > > > > > Indeed, this does seem like an odd failure. The shutdown on transaction > > cancel implies cancellation of a dirty transaction. This is not > > necessarily corruption as much as just being the generic > > naming/messaging related to shutdowns due to unexpected in-core state. > > The patch in question removes some modifications to in-core busy extent > > state during extent allocation that are fundamentally unsafe in > > combination with how allocation works. This change doesn't appear to > > affect any transaction directly, so the correlation may be indirect. > > > > xfs/076 looks like it's a sparse inode allocation test, which certainly > > seems relevant in that it is stressing the ability to allocate inode > > chunks under free space fragmentation. If this patch further restricts > > extent allocation by removing availability of some set of (recently > > freed, busy) extents, then perhaps there is some allocation failure > > sequence that was previously unlikely enough to mask some poor error > > handling logic or transaction handling (like an agfl fixup dirtying a > > transaction followed by an allocation failure, for example) that we're > > now running into. > > > > > Can you think of another reason to explain the regression this fix > > > introduces to 5.10.y? > > > > > > > Not off the top of my head. Something along the lines of the above seems > > plausible, but that's just speculation at this point. > > > > > Do you care to investigate this failure or shall I just move on? > > > > > > > I think it would be good to understand whether there's a regression > > introduced by this patch, a bug somewhere else or just some impedence > > mismatch in logic between the combination of this change and whatever > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce > > if I pull just this commit back into latest 5.10.y (5.10.118). I've > > tried with a traditional bdev as well as a preallocated and sparse > > loopback scratch dev. > > I also failed to reproduce it on another VM, but it reproduces reliably > on this system. That's why I thought we'd better use this opportunity. > This system has lots of RAM and disk to spare so I have no problem > running this test in a VM in parallel to my work. > > It is not actually my system, it's a system that Luis has setup for > stable XFS testing and gave me access to, so if the need arises > you could get direct access to the system, but for now, I have no > problem running the test for you. > > > Have you tested this patch (backport) in isolation > > in your reproducer env or only in combination with other pending > > backports? > > > > I tested it on top of 5.10.109 + these 5 patches: > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > I can test it in isolation if you like. Let me know if there are > other forensics that you would like me to collect. > Hm. Still no luck if I move to .109 and pull in those few patches. I assume there's nothing else potentially interesting about the test env other than the sparse file scratch dev (i.e., default mkfs options, etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt to try and grab a tracepoint dump of the test when it fails (feel free to send directly or upload somewhere as the list may punt it, and please also include the dmesg output that goes along with it) and I can see if that shows anything helpful. I think what we want to know initially is what error code we're producing (-ENOSPC?) and where it originates, and from there we can probably work out how the transaction might be dirty. I'm not sure a trace dump will express that conclusively. If you wanted to increase the odds of getting some useful information it might be helpful to stick a few trace_printk() calls in the various trans cancel error paths out of xfs_create() to determine whether it's the inode allocation attempt that fails or the subsequent attempt to create the directory entry.. Brian > Thanks, > Amir. >
> > I tested it on top of 5.10.109 + these 5 patches: > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > I can test it in isolation if you like. Let me know if there are > > other forensics that you would like me to collect. > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > assume there's nothing else potentially interesting about the test env > other than the sparse file scratch dev (i.e., default mkfs options, Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults are reflink=0. Actually, the section I am running is reflink_normapbt, but... ** mkfs failed with extra mkfs options added to "-f -m reflink=1,rmapbt=0, -i sparse=1," by test 076 ** ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse ** ** mkfs failed with extra mkfs options added to "-f -m reflink=1,rmapbt=0, -i sparse=1," by test 076 ** ** attempting to mkfs using only test 076 options: -d size=50m -m crc=1 -i sparse ** mkfs.xfs does not accept double sparse argument, so the test falls back to mkfs defaults (+ sparse) I checked and xfsprogs 5.3 behaves the same, I did not check newer xfsprogs, but that seems like a test bug(?) IWO, unless xfsprogs was changed to be more tolerable to repeating arguments, then maybe nobody is testing xfs/076 with reflink=0 (?) > etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt > to try and grab a tracepoint dump of the test when it fails (feel free > to send directly or upload somewhere as the list may punt it, and please > also include the dmesg output that goes along with it) and I can see if > that shows anything helpful. > > I think what we want to know initially is what error code we're > producing (-ENOSPC?) and where it originates, and from there we can > probably work out how the transaction might be dirty. I'm not sure a > trace dump will express that conclusively. If you wanted to increase the > odds of getting some useful information it might be helpful to stick a > few trace_printk() calls in the various trans cancel error paths out of > xfs_create() to determine whether it's the inode allocation attempt that > fails or the subsequent attempt to create the directory entry.. > Well, the full output is filled with ENOSPC (also in a good run), so it's probably that, but I will try to get to that failing stack, no need for all the noisy traces. Signing off the day. hope I will get to it tomorrow. Thanks, Amir. P.S: this is how 076.full ends if it makes any difference: touch: cannot touch '/media/scratch/offset.21889024/63': No space left on device touch: cannot touch '/media/scratch/offset.21823488/63': No space left on device touch: cannot touch '/media/scratch/offset.21757952/63': No space left on device touch: cannot touch '/media/scratch/offset.21692416/63': No space left on device touch: cannot touch '/media/scratch/offset.21626880/63': No space left on device touch: cannot touch '/media/scratch/offset.21561344/63': No space left on device touch: cannot touch '/media/scratch/offset.21495808/63': No space left on device touch: cannot touch '/media/scratch/offset.21430272/63': No space left on device stat: Input/output error fpunch failed
On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > I tested it on top of 5.10.109 + these 5 patches: > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > > > I can test it in isolation if you like. Let me know if there are > > > other forensics that you would like me to collect. > > > > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > > assume there's nothing else potentially interesting about the test env > > other than the sparse file scratch dev (i.e., default mkfs options, > > Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults > are reflink=0. > > Actually, the section I am running is reflink_normapbt, but... > > ** mkfs failed with extra mkfs options added to "-f -m > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse ** > ** mkfs failed with extra mkfs options added to "-f -m > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > ** attempting to mkfs using only test 076 options: -d size=50m -m > crc=1 -i sparse ** > > mkfs.xfs does not accept double sparse argument, so the > test falls back to mkfs defaults (+ sparse) > > I checked and xfsprogs 5.3 behaves the same, I did not check newer > xfsprogs, but that seems like a test bug(?) > xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many other tests ignore the custom mkfs options for the specific sections. That is a big test coverage issue! > IWO, unless xfsprogs was changed to be more tolerable to repeating > arguments, then maybe nobody is testing xfs/076 with reflink=0 (?) > Bingo! Test passes 100 runs with debian/testing - xfsprogs v5.16 I shall try to amend the test to force reflink=0 to see what happens. You should try it as well. Thanks, Amir.
On Fri, May 27, 2022 at 10:06:46AM +0300, Amir Goldstein wrote: > On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > I tested it on top of 5.10.109 + these 5 patches: > > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > > > > > I can test it in isolation if you like. Let me know if there are > > > > other forensics that you would like me to collect. > > > > > > > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > > > assume there's nothing else potentially interesting about the test env > > > other than the sparse file scratch dev (i.e., default mkfs options, > > > > Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults > > are reflink=0. > > > > Actually, the section I am running is reflink_normapbt, but... > > > > ** mkfs failed with extra mkfs options added to "-f -m > > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > > ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse ** > > ** mkfs failed with extra mkfs options added to "-f -m > > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > > ** attempting to mkfs using only test 076 options: -d size=50m -m > > crc=1 -i sparse ** > > > > mkfs.xfs does not accept double sparse argument, so the > > test falls back to mkfs defaults (+ sparse) > > > > I checked and xfsprogs 5.3 behaves the same, I did not check newer > > xfsprogs, but that seems like a test bug(?) > > > > xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many > other tests ignore the custom mkfs options for the specific sections. > That is a big test coverage issue! > > > IWO, unless xfsprogs was changed to be more tolerable to repeating > > arguments, then maybe nobody is testing xfs/076 with reflink=0 (?) > > > > Bingo! > Test passes 100 runs with debian/testing - xfsprogs v5.16 > > I shall try to amend the test to force reflink=0 to see what happens. > You should try it as well. > Interesting. If I set -mreflink=0 xfs/076 seems to do the right thing and format the scratch device as expected, but I'm still not seeing a failure on my system for whatever reason. Brian > Thanks, > Amir. >
On Fri, May 27, 2022 at 3:50 PM Brian Foster <bfoster@redhat.com> wrote: > > On Fri, May 27, 2022 at 10:06:46AM +0300, Amir Goldstein wrote: > > On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > > > I tested it on top of 5.10.109 + these 5 patches: > > > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > > > > > > > I can test it in isolation if you like. Let me know if there are > > > > > other forensics that you would like me to collect. > > > > > > > > > > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > > > > assume there's nothing else potentially interesting about the test env > > > > other than the sparse file scratch dev (i.e., default mkfs options, > > > > > > Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults > > > are reflink=0. > > > > > > Actually, the section I am running is reflink_normapbt, but... > > > > > > ** mkfs failed with extra mkfs options added to "-f -m > > > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > > > ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse ** > > > ** mkfs failed with extra mkfs options added to "-f -m > > > reflink=1,rmapbt=0, -i sparse=1," by test 076 ** > > > ** attempting to mkfs using only test 076 options: -d size=50m -m > > > crc=1 -i sparse ** > > > > > > mkfs.xfs does not accept double sparse argument, so the > > > test falls back to mkfs defaults (+ sparse) > > > > > > I checked and xfsprogs 5.3 behaves the same, I did not check newer > > > xfsprogs, but that seems like a test bug(?) > > > > > > > xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many > > other tests ignore the custom mkfs options for the specific sections. > > That is a big test coverage issue! > > > > > IWO, unless xfsprogs was changed to be more tolerable to repeating > > > arguments, then maybe nobody is testing xfs/076 with reflink=0 (?) > > > > > > > Bingo! > > Test passes 100 runs with debian/testing - xfsprogs v5.16 > > > > I shall try to amend the test to force reflink=0 to see what happens. > > You should try it as well. > > > > Interesting. If I set -mreflink=0 xfs/076 seems to do the right thing > and format the scratch device as expected, but I'm still not seeing a > failure on my system for whatever reason. Me neither (on my home VM). I will investigate the assertion on the system where it reproduces. May take me some time but I will get to it. Thanks, Amir.
On Thu, May 26, 2022 at 10:47 PM Brian Foster <bfoster@redhat.com> wrote: > > On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote: > > > > Hi Brian, > > > > > > > > This patch was one of my selected fixes to backport for v5.10.y. > > > > It has a very scary looking commit message and the change seems > > > > to be independent of any infrastructure changes(?). > > > > > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > > > this buffer corruption assertion [*] with test xfs/076. > > > > > > > > This happens on the kdevops system that is using loop devices over > > > > sparse files inside qemu images. It does not reproduce on my small > > > > VM at home. > > > > > > > > Normally, I would just drop this patch from the stable candidates queue > > > > and move on, but I thought you might be interested to investigate this > > > > reliable reproducer, because maybe this system exercises an error > > > > that is otherwise rare to hit. > > > > > > > > It seemed weird to me that NOT reusing the extent would result in > > > > data corruption, but it could indicate that reusing the extent was masking > > > > the assertion and hiding another bug(?). > > > > > > > > > > Indeed, this does seem like an odd failure. The shutdown on transaction > > > cancel implies cancellation of a dirty transaction. This is not > > > necessarily corruption as much as just being the generic > > > naming/messaging related to shutdowns due to unexpected in-core state. > > > The patch in question removes some modifications to in-core busy extent > > > state during extent allocation that are fundamentally unsafe in > > > combination with how allocation works. This change doesn't appear to > > > affect any transaction directly, so the correlation may be indirect. > > > > > > xfs/076 looks like it's a sparse inode allocation test, which certainly > > > seems relevant in that it is stressing the ability to allocate inode > > > chunks under free space fragmentation. If this patch further restricts > > > extent allocation by removing availability of some set of (recently > > > freed, busy) extents, then perhaps there is some allocation failure > > > sequence that was previously unlikely enough to mask some poor error > > > handling logic or transaction handling (like an agfl fixup dirtying a > > > transaction followed by an allocation failure, for example) that we're > > > now running into. > > > > > > > Can you think of another reason to explain the regression this fix > > > > introduces to 5.10.y? > > > > > > > > > > Not off the top of my head. Something along the lines of the above seems > > > plausible, but that's just speculation at this point. > > > > > > > Do you care to investigate this failure or shall I just move on? > > > > > > > > > > I think it would be good to understand whether there's a regression > > > introduced by this patch, a bug somewhere else or just some impedence > > > mismatch in logic between the combination of this change and whatever > > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce > > > if I pull just this commit back into latest 5.10.y (5.10.118). I've > > > tried with a traditional bdev as well as a preallocated and sparse > > > loopback scratch dev. > > > > I also failed to reproduce it on another VM, but it reproduces reliably > > on this system. That's why I thought we'd better use this opportunity. > > This system has lots of RAM and disk to spare so I have no problem > > running this test in a VM in parallel to my work. > > > > It is not actually my system, it's a system that Luis has setup for > > stable XFS testing and gave me access to, so if the need arises > > you could get direct access to the system, but for now, I have no > > problem running the test for you. > > > > > Have you tested this patch (backport) in isolation > > > in your reproducer env or only in combination with other pending > > > backports? > > > > > > > I tested it on top of 5.10.109 + these 5 patches: > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > I can test it in isolation if you like. Let me know if there are > > other forensics that you would like me to collect. > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > assume there's nothing else potentially interesting about the test env > other than the sparse file scratch dev (i.e., default mkfs options, > etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt > to try and grab a tracepoint dump of the test when it fails (feel free > to send directly or upload somewhere as the list may punt it, and please > also include the dmesg output that goes along with it) and I can see if > that shows anything helpful. > > I think what we want to know initially is what error code we're > producing (-ENOSPC?) and where it originates, and from there we can > probably work out how the transaction might be dirty. I'm not sure a > trace dump will express that conclusively. If you wanted to increase the > odds of getting some useful information it might be helpful to stick a > few trace_printk() calls in the various trans cancel error paths out of > xfs_create() to determine whether it's the inode allocation attempt that > fails or the subsequent attempt to create the directory entry.. > The error (-ENOSPC) comes from this v5.10 code in xfs_dir_ialloc(): if (!ialloc_context && !ip) { *ipp = NULL; return -ENOSPC; } Which theoretically might trip after xfs_ialloc() has marked the transaction dirty(?). This specific code is gone with this cleanup series in v5.11: https://lore.kernel.org/linux-xfs/20201209112820.114863-1-hsiangkao@redhat.com/ When the $SUBJECT patch is applied to v5.11.16 the test xfs/076 does not fail. So either the $SUBJECT patch (from 5.12) is incompatible with v5.10 code or the cleanup series somehow managed to make my system not reproduce the bug anymore. I will assume the former and drop this patch from v5.10.y candidates. If you want me to continue to research the bug on v5.10 let me know what else you want me to check. Thanks, Amir.
On Sat, May 28, 2022 at 05:23:19PM +0300, Amir Goldstein wrote: > On Thu, May 26, 2022 at 10:47 PM Brian Foster <bfoster@redhat.com> wrote: > > > > On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote: > > > > > Hi Brian, > > > > > > > > > > This patch was one of my selected fixes to backport for v5.10.y. > > > > > It has a very scary looking commit message and the change seems > > > > > to be independent of any infrastructure changes(?). > > > > > > > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > > > > this buffer corruption assertion [*] with test xfs/076. > > > > > > > > > > This happens on the kdevops system that is using loop devices over > > > > > sparse files inside qemu images. It does not reproduce on my small > > > > > VM at home. > > > > > > > > > > Normally, I would just drop this patch from the stable candidates queue > > > > > and move on, but I thought you might be interested to investigate this > > > > > reliable reproducer, because maybe this system exercises an error > > > > > that is otherwise rare to hit. > > > > > > > > > > It seemed weird to me that NOT reusing the extent would result in > > > > > data corruption, but it could indicate that reusing the extent was masking > > > > > the assertion and hiding another bug(?). > > > > > > > > > > > > > Indeed, this does seem like an odd failure. The shutdown on transaction > > > > cancel implies cancellation of a dirty transaction. This is not > > > > necessarily corruption as much as just being the generic > > > > naming/messaging related to shutdowns due to unexpected in-core state. > > > > The patch in question removes some modifications to in-core busy extent > > > > state during extent allocation that are fundamentally unsafe in > > > > combination with how allocation works. This change doesn't appear to > > > > affect any transaction directly, so the correlation may be indirect. > > > > > > > > xfs/076 looks like it's a sparse inode allocation test, which certainly > > > > seems relevant in that it is stressing the ability to allocate inode > > > > chunks under free space fragmentation. If this patch further restricts > > > > extent allocation by removing availability of some set of (recently > > > > freed, busy) extents, then perhaps there is some allocation failure > > > > sequence that was previously unlikely enough to mask some poor error > > > > handling logic or transaction handling (like an agfl fixup dirtying a > > > > transaction followed by an allocation failure, for example) that we're > > > > now running into. > > > > > > > > > Can you think of another reason to explain the regression this fix > > > > > introduces to 5.10.y? > > > > > > > > > > > > > Not off the top of my head. Something along the lines of the above seems > > > > plausible, but that's just speculation at this point. > > > > > > > > > Do you care to investigate this failure or shall I just move on? > > > > > > > > > > > > > I think it would be good to understand whether there's a regression > > > > introduced by this patch, a bug somewhere else or just some impedence > > > > mismatch in logic between the combination of this change and whatever > > > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce > > > > if I pull just this commit back into latest 5.10.y (5.10.118). I've > > > > tried with a traditional bdev as well as a preallocated and sparse > > > > loopback scratch dev. > > > > > > I also failed to reproduce it on another VM, but it reproduces reliably > > > on this system. That's why I thought we'd better use this opportunity. > > > This system has lots of RAM and disk to spare so I have no problem > > > running this test in a VM in parallel to my work. > > > > > > It is not actually my system, it's a system that Luis has setup for > > > stable XFS testing and gave me access to, so if the need arises > > > you could get direct access to the system, but for now, I have no > > > problem running the test for you. > > > > > > > Have you tested this patch (backport) in isolation > > > > in your reproducer env or only in combination with other pending > > > > backports? > > > > > > > > > > I tested it on top of 5.10.109 + these 5 patches: > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1 > > > > > > I can test it in isolation if you like. Let me know if there are > > > other forensics that you would like me to collect. > > > > > > > Hm. Still no luck if I move to .109 and pull in those few patches. I > > assume there's nothing else potentially interesting about the test env > > other than the sparse file scratch dev (i.e., default mkfs options, > > etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt > > to try and grab a tracepoint dump of the test when it fails (feel free > > to send directly or upload somewhere as the list may punt it, and please > > also include the dmesg output that goes along with it) and I can see if > > that shows anything helpful. > > > > I think what we want to know initially is what error code we're > > producing (-ENOSPC?) and where it originates, and from there we can > > probably work out how the transaction might be dirty. I'm not sure a > > trace dump will express that conclusively. If you wanted to increase the > > odds of getting some useful information it might be helpful to stick a > > few trace_printk() calls in the various trans cancel error paths out of > > xfs_create() to determine whether it's the inode allocation attempt that > > fails or the subsequent attempt to create the directory entry.. > > > > The error (-ENOSPC) comes from this v5.10 code in xfs_dir_ialloc(): > > if (!ialloc_context && !ip) { > *ipp = NULL; > return -ENOSPC; > } > > Which theoretically might trip after xfs_ialloc() has marked the transaction > dirty(?). > Yeah, I think the first part that might have dirtied the transaction at this point is fixing up the AGFL on a block allocation attempt. Ideally the AG selection code would prevent taking this step for an AG that can't satisfy an allocation, but realistically I'm not sure this approach will ever work perfectly unless it separates out the extent search algorithm from the prospect of dirtying the transaction and thus committing to the allocation. This may not be trivial because the AGFL fixup can require extent allocation itself. It's not terribly surprising that limiting reuse of busy extents could increase the likelihood of allocation failure, though I think most cases should flush busy extents and retry. I wonder a bit whether changes to the near mode allocation algorithm may have introduced a potential regression in that regard. > This specific code is gone with this cleanup series in v5.11: > > https://lore.kernel.org/linux-xfs/20201209112820.114863-1-hsiangkao@redhat.com/ > > When the $SUBJECT patch is applied to v5.11.16 the test xfs/076 does not fail. > > So either the $SUBJECT patch (from 5.12) is incompatible with v5.10 code > or the cleanup series somehow managed to make my system not reproduce > the bug anymore. > I thought this was mostly refactoring and reworking the ugly retry pattern in the former code vs. major functional changes. I think the fundamental prospect of allocation failure still exists in current code, but clearly your tests demonstrate some practical difference. Perhaps there are some subtle logic changes in that rework that help prevent this problem. > I will assume the former and drop this patch from v5.10.y candidates. > If you want me to continue to research the bug on v5.10 let me know > what else you want me to check. > I think that's reasonable. Realistically the bug fixed by this patch is so old and long standing (v3.0?) I don't think it's terribly important to pull it from v5.12 to v5.10.y unless real users started hitting it. My curiosity around the cause is moreso to identify whether there's a bug in mainline that needs fixing.. Brian > Thanks, > Amir. >
Hi, On 2022/5/26 19:34, Amir Goldstein wrote: > On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote: >> >> On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: >>> On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: >>>> Freed extents are marked busy from the point the freeing transaction >>>> commits until the associated CIL context is checkpointed to the log. >>>> This prevents reuse and overwrite of recently freed blocks before >>>> the changes are committed to disk, which can lead to corruption >>>> after a crash. The exception to this rule is that metadata >>>> allocation is allowed to reuse busy extents because metadata changes >>>> are also logged. >>>> >>>> As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS >>>> has allowed modification or complete invalidation of outstanding >>>> busy extents for metadata allocations. This implementation assumes >>>> that use of the associated extent is imminent, which is not always >>>> the case. For example, the trimmed extent might not satisfy the >>>> minimum length of the allocation request, or the allocation >>>> algorithm might be involved in a search for the optimal result based >>>> on locality. >>>> >>>> generic/019 reproduces a corruption caused by this scenario. First, >>>> a metadata block (usually a bmbt or symlink block) is freed from an >>>> inode. A subsequent bmbt split on an unrelated inode attempts a near >>>> mode allocation request that invalidates the busy block during the >>>> search, but does not ultimately allocate it. Due to the busy state >>>> invalidation, the block is no longer considered busy to subsequent >>>> allocation. A direct I/O write request immediately allocates the >>>> block and writes to it. >>> ... >> > > Hi Brian, > > This patch was one of my selected fixes to backport for v5.10.y. > It has a very scary looking commit message and the change seems > to be independent of any infrastructure changes(?). > > The problem is that applying this patch to v5.10.y reliably reproduces > this buffer corruption assertion [*] with test xfs/076. > > This happens on the kdevops system that is using loop devices over > sparse files inside qemu images. It does not reproduce on my small > VM at home. > > Normally, I would just drop this patch from the stable candidates queue > and move on, but I thought you might be interested to investigate this > reliable reproducer, because maybe this system exercises an error > that is otherwise rare to hit. > > It seemed weird to me that NOT reusing the extent would result in > data corruption, but it could indicate that reusing the extent was masking > the assertion and hiding another bug(?). > > Can you think of another reason to explain the regression this fix > introduces to 5.10.y? > > Do you care to investigate this failure or shall I just move on? > > Thanks, > Amir. > > [*] > : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file > fs/xfs/xfs_trans.c. Caller xfs_create+0x22f/0x590 [xfs] > : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G E > 5.10.109-xfs-2 #8 > : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > 04/01/2014 > : Call Trace: > : dump_stack+0x6d/0x88 > : xfs_trans_cancel+0x17b/0x1a0 [xfs] > : xfs_create+0x22f/0x590 [xfs] > : xfs_generic_create+0x245/0x310 [xfs] > : ? d_splice_alias+0x13a/0x3c0 > : path_openat+0xe3f/0x1080 > : do_filp_open+0x93/0x100 > : ? handle_mm_fault+0x148e/0x1690 > : ? __check_object_size+0x162/0x180 > : do_sys_openat2+0x228/0x2d0 > : do_sys_open+0x4b/0x80 > : do_syscall_64+0x33/0x80 > : entry_SYSCALL_64_after_hwframe+0x44/0xa9 > : RIP: 0033:0x7f36b02eff1e > : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00 > 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d > 00 f0 ff ff 0 > : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e > : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c > : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 > : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002 > : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001 > : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file > fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764 > : XFS (loop5): Corruption of in-memory data detected. Shutting down filesystem > : XFS (loop5): Please unmount the filesystem and rectify the problem(s) > (...just for the record) We also encountered this issue but without commit xfs: don't reuse busy extents on extent trim applied in our 5.10 codebase... Need to find some time to look into that... [ 413.283300] XFS (loop1): Internal error xfs_trans_cancel at line 950 of file fs/xfs/xfs_trans.c. Caller xfs_create+0x219/0x590 [xfs] [ 413.284295] CPU: 0 PID: 27484 Comm: touch Tainted: G E 5.10.134-13.an8.x86_64 #1 [ 413.284296] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 [ 413.284297] Call Trace: [ 413.284314] dump_stack+0x57/0x6e [ 413.284373] xfs_trans_cancel+0xa3/0x110 [xfs] [ 413.284412] xfs_create+0x219/0x590 [xfs] [ 413.284458] xfs_generic_create+0x21f/0x2d0 [xfs] [ 413.284462] path_openat+0xdee/0x1020 [ 413.284464] do_filp_open+0x80/0xd0 [ 413.284467] ? __check_object_size+0x16a/0x180 [ 413.284469] do_sys_openat2+0x207/0x2c0 [ 413.284471] do_sys_open+0x3b/0x60 [ 413.284475] do_syscall_64+0x33/0x40 [ 413.284478] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [ 413.284481] RIP: 0033:0x7fe623920252 [ 413.284482] Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 55 43 2a 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 [ 413.284483] RSP: 002b:00007ffd7a38ca70 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [ 413.284485] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fe623920252 [ 413.284486] RDX: 0000000000000941 RSI: 00007ffd7a38d79d RDI: 00000000ffffff9c [ 413.284487] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [ 413.284488] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001 [ 413.284488] R13: 0000000000000001 R14: 00007ffd7a38d79d R15: 00007fe623bbf374 [ 413.289856] XFS (loop1): xfs_do_force_shutdown(0x8) called from line 951 of file fs/xfs/xfs_trans.c. Return address = 000000003fe0b8ba [ 413.289858] XFS (loop1): Corruption of in-memory data detected. Shutting down filesystem [ 413.290573] XFS (loop1): Please unmount the filesystem and rectify the problem(s) Thanks, Gao Xiang
On Wed, Jan 11, 2023 at 10:41:43PM +0800, Gao Xiang wrote: > Hi, > > On 2022/5/26 19:34, Amir Goldstein wrote: > > On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote: > > > > > > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote: > > > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote: > > > > > Freed extents are marked busy from the point the freeing transaction > > > > > commits until the associated CIL context is checkpointed to the log. > > > > > This prevents reuse and overwrite of recently freed blocks before > > > > > the changes are committed to disk, which can lead to corruption > > > > > after a crash. The exception to this rule is that metadata > > > > > allocation is allowed to reuse busy extents because metadata changes > > > > > are also logged. > > > > > > > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS > > > > > has allowed modification or complete invalidation of outstanding > > > > > busy extents for metadata allocations. This implementation assumes > > > > > that use of the associated extent is imminent, which is not always > > > > > the case. For example, the trimmed extent might not satisfy the > > > > > minimum length of the allocation request, or the allocation > > > > > algorithm might be involved in a search for the optimal result based > > > > > on locality. > > > > > > > > > > generic/019 reproduces a corruption caused by this scenario. First, > > > > > a metadata block (usually a bmbt or symlink block) is freed from an > > > > > inode. A subsequent bmbt split on an unrelated inode attempts a near > > > > > mode allocation request that invalidates the busy block during the > > > > > search, but does not ultimately allocate it. Due to the busy state > > > > > invalidation, the block is no longer considered busy to subsequent > > > > > allocation. A direct I/O write request immediately allocates the > > > > > block and writes to it. > > > > > > ... > > > > > > > > Hi Brian, > > > > This patch was one of my selected fixes to backport for v5.10.y. > > It has a very scary looking commit message and the change seems > > to be independent of any infrastructure changes(?). > > > > The problem is that applying this patch to v5.10.y reliably reproduces > > this buffer corruption assertion [*] with test xfs/076. > > > > This happens on the kdevops system that is using loop devices over > > sparse files inside qemu images. It does not reproduce on my small > > VM at home. > > > > Normally, I would just drop this patch from the stable candidates queue > > and move on, but I thought you might be interested to investigate this > > reliable reproducer, because maybe this system exercises an error > > that is otherwise rare to hit. > > > > It seemed weird to me that NOT reusing the extent would result in > > data corruption, but it could indicate that reusing the extent was masking > > the assertion and hiding another bug(?). > > > > Can you think of another reason to explain the regression this fix > > introduces to 5.10.y? > > > > Do you care to investigate this failure or shall I just move on? > > > > Thanks, > > Amir. > > > > [*] > > : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file > > fs/xfs/xfs_trans.c. Caller xfs_create+0x22f/0x590 [xfs] > > : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G E > > 5.10.109-xfs-2 #8 > > : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 > > 04/01/2014 > > : Call Trace: > > : dump_stack+0x6d/0x88 > > : xfs_trans_cancel+0x17b/0x1a0 [xfs] > > : xfs_create+0x22f/0x590 [xfs] > > : xfs_generic_create+0x245/0x310 [xfs] > > : ? d_splice_alias+0x13a/0x3c0 > > : path_openat+0xe3f/0x1080 > > : do_filp_open+0x93/0x100 > > : ? handle_mm_fault+0x148e/0x1690 > > : ? __check_object_size+0x162/0x180 > > : do_sys_openat2+0x228/0x2d0 > > : do_sys_open+0x4b/0x80 > > : do_syscall_64+0x33/0x80 > > : entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > : RIP: 0033:0x7f36b02eff1e > > : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00 > > 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d > > 00 f0 ff ff 0 > > : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > > : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e > > : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c > > : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 > > : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002 > > : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001 > > : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file > > fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764 > > : XFS (loop5): Corruption of in-memory data detected. Shutting down filesystem > > : XFS (loop5): Please unmount the filesystem and rectify the problem(s) > > > > (...just for the record) We also encountered this issue but without commit > xfs: don't reuse busy extents on extent trim > applied in our 5.10 codebase... > > Need to find some time to look into that... > > [ 413.283300] XFS (loop1): Internal error xfs_trans_cancel at line 950 of file fs/xfs/xfs_trans.c. Caller xfs_create+0x219/0x590 [xfs] > [ 413.284295] CPU: 0 PID: 27484 Comm: touch Tainted: G E 5.10.134-13.an8.x86_64 #1 > [ 413.284296] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014 > [ 413.284297] Call Trace: > [ 413.284314] dump_stack+0x57/0x6e > [ 413.284373] xfs_trans_cancel+0xa3/0x110 [xfs] > [ 413.284412] xfs_create+0x219/0x590 [xfs] > [ 413.284458] xfs_generic_create+0x21f/0x2d0 [xfs] > [ 413.284462] path_openat+0xdee/0x1020 > [ 413.284464] do_filp_open+0x80/0xd0 > [ 413.284467] ? __check_object_size+0x16a/0x180 > [ 413.284469] do_sys_openat2+0x207/0x2c0 > [ 413.284471] do_sys_open+0x3b/0x60 > [ 413.284475] do_syscall_64+0x33/0x40 > [ 413.284478] entry_SYSCALL_64_after_hwframe+0x61/0xc6 > [ 413.284481] RIP: 0033:0x7fe623920252 > [ 413.284482] Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 55 43 2a 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25 > [ 413.284483] RSP: 002b:00007ffd7a38ca70 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 > [ 413.284485] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fe623920252 > [ 413.284486] RDX: 0000000000000941 RSI: 00007ffd7a38d79d RDI: 00000000ffffff9c > [ 413.284487] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 > [ 413.284488] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001 > [ 413.284488] R13: 0000000000000001 R14: 00007ffd7a38d79d R15: 00007fe623bbf374 > [ 413.289856] XFS (loop1): xfs_do_force_shutdown(0x8) called from line 951 of file fs/xfs/xfs_trans.c. Return address = 000000003fe0b8ba > [ 413.289858] XFS (loop1): Corruption of in-memory data detected. Shutting down filesystem > [ 413.290573] XFS (loop1): Please unmount the filesystem and rectify the problem(s) This likely has nothing to do with the commit in this thread. The shutdowns reported in this thread are *not metadata corruption*; this typically occurs when ENOSPC has occurred during inode creation after an AGF has been dirtied. i.e. xfs_alloc_fix_freelist() has dirtied the selected AGF and AGFL doing AGFL fixup, then not had enough space left to allocate the minlen extent being requested. The allocation then cannot find space in any other AG (because ENOSPC), and the inode allocation fails and the transaction gets cancelled. Cancelling a dirty transaction is a fatal error - we cannot recover from it as the metadata state prior to the modification in the transaction cannot be easily recovered. i.e. we cannot roll back the changes that have been made, and hence we have to shut down the filesystem. The reason for the shutdown is that the in-memory state is not recoverable, hence it is considered corrupt. No actual corruption has occurred, we're just in an unrecoverable situation with partially modified metadata. IOWs, this is a symptom of a buggy "can we allocate successfully in this AG" check prior to attemptimg allocating and modifying metadata, not that there is actual filesystem corruption occurring. Cheers, Dave.
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c index 3991e59cfd18..ef17c1f6db32 100644 --- a/fs/xfs/xfs_extent_busy.c +++ b/fs/xfs/xfs_extent_busy.c @@ -344,7 +344,6 @@ xfs_extent_busy_trim( ASSERT(*len > 0); spin_lock(&args->pag->pagb_lock); -restart: fbno = *bno; flen = *len; rbp = args->pag->pagb_tree.rb_node; @@ -363,19 +362,6 @@ xfs_extent_busy_trim( continue; } - /* - * If this is a metadata allocation, try to reuse the busy - * extent instead of trimming the allocation. - */ - if (!(args->datatype & XFS_ALLOC_USERDATA) && - !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) { - if (!xfs_extent_busy_update_extent(args->mp, args->pag, - busyp, fbno, flen, - false)) - goto restart; - continue; - } - if (bbno <= fbno) { /* start overlap */
Freed extents are marked busy from the point the freeing transaction commits until the associated CIL context is checkpointed to the log. This prevents reuse and overwrite of recently freed blocks before the changes are committed to disk, which can lead to corruption after a crash. The exception to this rule is that metadata allocation is allowed to reuse busy extents because metadata changes are also logged. As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS has allowed modification or complete invalidation of outstanding busy extents for metadata allocations. This implementation assumes that use of the associated extent is imminent, which is not always the case. For example, the trimmed extent might not satisfy the minimum length of the allocation request, or the allocation algorithm might be involved in a search for the optimal result based on locality. generic/019 reproduces a corruption caused by this scenario. First, a metadata block (usually a bmbt or symlink block) is freed from an inode. A subsequent bmbt split on an unrelated inode attempts a near mode allocation request that invalidates the busy block during the search, but does not ultimately allocate it. Due to the busy state invalidation, the block is no longer considered busy to subsequent allocation. A direct I/O write request immediately allocates the block and writes to it. Finally, the filesystem crashes while in a state where the initial metadata block free had not committed to the on-disk log. After recovery, the original metadata block is in its original location as expected, but has been corrupted by the aforementioned dio. This demonstrates that it is fundamentally unsafe to modify busy extent state for extents that are not guaranteed to be allocated. This applies to pretty much all of the code paths that currently trim busy extents for one reason or another. Therefore to address this problem, drop the reuse mechanism from the busy extent trim path. This code already knows how to return partial non-busy ranges of the targeted free extent and higher level code tracks the busy state of the allocation attempt. If a block allocation fails where one or more candidate extents is busy, we force the log and retry the allocation. Signed-off-by: Brian Foster <bfoster@redhat.com> --- fs/xfs/xfs_extent_busy.c | 14 -------------- 1 file changed, 14 deletions(-)