xfs: don't reuse busy extents on extent trim

Message ID	20210222153442.897089-1-bfoster@redhat.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Brian Foster <bfoster@redhat.com> To: linux-xfs@vger.kernel.org Subject: [PATCH] xfs: don't reuse busy extents on extent trim Date: Mon, 22 Feb 2021 10:34:42 -0500 Message-Id: <20210222153442.897089-1-bfoster@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: don't reuse busy extents on extent trim \| expand xfs: don't reuse busy extents on extent trim

Brian Foster Feb. 22, 2021, 3:34 p.m. UTC

Freed extents are marked busy from the point the freeing transaction
commits until the associated CIL context is checkpointed to the log.
This prevents reuse and overwrite of recently freed blocks before
the changes are committed to disk, which can lead to corruption
after a crash. The exception to this rule is that metadata
allocation is allowed to reuse busy extents because metadata changes
are also logged.

As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
has allowed modification or complete invalidation of outstanding
busy extents for metadata allocations. This implementation assumes
that use of the associated extent is imminent, which is not always
the case. For example, the trimmed extent might not satisfy the
minimum length of the allocation request, or the allocation
algorithm might be involved in a search for the optimal result based
on locality.

generic/019 reproduces a corruption caused by this scenario. First,
a metadata block (usually a bmbt or symlink block) is freed from an
inode. A subsequent bmbt split on an unrelated inode attempts a near
mode allocation request that invalidates the busy block during the
search, but does not ultimately allocate it. Due to the busy state
invalidation, the block is no longer considered busy to subsequent
allocation. A direct I/O write request immediately allocates the
block and writes to it. Finally, the filesystem crashes while in a
state where the initial metadata block free had not committed to the
on-disk log. After recovery, the original metadata block is in its
original location as expected, but has been corrupted by the
aforementioned dio.

This demonstrates that it is fundamentally unsafe to modify busy
extent state for extents that are not guaranteed to be allocated.
This applies to pretty much all of the code paths that currently
trim busy extents for one reason or another. Therefore to address
this problem, drop the reuse mechanism from the busy extent trim
path. This code already knows how to return partial non-busy ranges
of the targeted free extent and higher level code tracks the busy
state of the allocation attempt. If a block allocation fails where
one or more candidate extents is busy, we force the log and retry
the allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_extent_busy.c | 14 --------------
 1 file changed, 14 deletions(-)

Darrick J. Wong Feb. 22, 2021, 6:27 p.m. UTC | #1

On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> Freed extents are marked busy from the point the freeing transaction
> commits until the associated CIL context is checkpointed to the log.
> This prevents reuse and overwrite of recently freed blocks before
> the changes are committed to disk, which can lead to corruption
> after a crash. The exception to this rule is that metadata
> allocation is allowed to reuse busy extents because metadata changes
> are also logged.
> 
> As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> has allowed modification or complete invalidation of outstanding
> busy extents for metadata allocations. This implementation assumes
> that use of the associated extent is imminent, which is not always
> the case. For example, the trimmed extent might not satisfy the
> minimum length of the allocation request, or the allocation
> algorithm might be involved in a search for the optimal result based
> on locality.
> 
> generic/019 reproduces a corruption caused by this scenario. First,
> a metadata block (usually a bmbt or symlink block) is freed from an
> inode. A subsequent bmbt split on an unrelated inode attempts a near
> mode allocation request that invalidates the busy block during the
> search, but does not ultimately allocate it. Due to the busy state
> invalidation, the block is no longer considered busy to subsequent
> allocation. A direct I/O write request immediately allocates the
> block and writes to it.

I really hope there's a fstest case coming for this... :)

> Finally, the filesystem crashes while in a
> state where the initial metadata block free had not committed to the
> on-disk log. After recovery, the original metadata block is in its
> original location as expected, but has been corrupted by the
> aforementioned dio.

Wheee!

Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go
find a freespace record, call xfs_extent_busy_trim (which could erase
the busy extent entry), decide that it's not interested after all, and
bail out without restoring the busy entry.

Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as
we wander around the free space btrees looking for a good chunk of
space... and doesn't restore the busy record if it decides to consider a
different extent.

So I guess this "speculatively remove busy records and forget to restore
them" behavior opens the door to the write allocating blocks that aren't
yet free and nonbusy, right?  And the solution presented here is to
avoid letting go of the busy record for the bmbt allocation, and if the
btree split caller decides it really /must/ have that block for the bmbt
it can force the log and try again, just like we do for a file data
allocation?

Another solution could have been to restore the record if we decide not
to go ahead with the allocation, but as we haven't yet committed to
using the space, there's no sense in thrashing the busy records?

Assuming I got all that right,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> This demonstrates that it is fundamentally unsafe to modify busy
> extent state for extents that are not guaranteed to be allocated.
> This applies to pretty much all of the code paths that currently
> trim busy extents for one reason or another. Therefore to address
> this problem, drop the reuse mechanism from the busy extent trim
> path. This code already knows how to return partial non-busy ranges
> of the targeted free extent and higher level code tracks the busy
> state of the allocation attempt. If a block allocation fails where
> one or more candidate extents is busy, we force the log and retry
> the allocation.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_extent_busy.c | 14 --------------
>  1 file changed, 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index 3991e59cfd18..ef17c1f6db32 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -344,7 +344,6 @@ xfs_extent_busy_trim(
>  	ASSERT(*len > 0);
>  
>  	spin_lock(&args->pag->pagb_lock);
> -restart:
>  	fbno = *bno;
>  	flen = *len;
>  	rbp = args->pag->pagb_tree.rb_node;
> @@ -363,19 +362,6 @@ xfs_extent_busy_trim(
>  			continue;
>  		}
>  
> -		/*
> -		 * If this is a metadata allocation, try to reuse the busy
> -		 * extent instead of trimming the allocation.
> -		 */
> -		if (!(args->datatype & XFS_ALLOC_USERDATA) &&
> -		    !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) {
> -			if (!xfs_extent_busy_update_extent(args->mp, args->pag,
> -							  busyp, fbno, flen,
> -							  false))
> -				goto restart;
> -			continue;
> -		}
> -
>  		if (bbno <= fbno) {
>  			/* start overlap */
>  
> -- 
> 2.26.2
>

Chandan Babu R Feb. 23, 2021, 6:24 a.m. UTC | #2

On 22 Feb 2021 at 21:04, Brian Foster wrote:
> Freed extents are marked busy from the point the freeing transaction
> commits until the associated CIL context is checkpointed to the log.
> This prevents reuse and overwrite of recently freed blocks before
> the changes are committed to disk, which can lead to corruption
> after a crash. The exception to this rule is that metadata
> allocation is allowed to reuse busy extents because metadata changes
> are also logged.
>
> As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> has allowed modification or complete invalidation of outstanding
> busy extents for metadata allocations. This implementation assumes
> that use of the associated extent is imminent, which is not always
> the case. For example, the trimmed extent might not satisfy the
> minimum length of the allocation request, or the allocation
> algorithm might be involved in a search for the optimal result based
> on locality.
>
> generic/019 reproduces a corruption caused by this scenario. First,
> a metadata block (usually a bmbt or symlink block) is freed from an
> inode. A subsequent bmbt split on an unrelated inode attempts a near
> mode allocation request that invalidates the busy block during the
> search, but does not ultimately allocate it. Due to the busy state
> invalidation, the block is no longer considered busy to subsequent
> allocation. A direct I/O write request immediately allocates the
> block and writes to it. Finally, the filesystem crashes while in a
> state where the initial metadata block free had not committed to the
> on-disk log. After recovery, the original metadata block is in its
> original location as expected, but has been corrupted by the
> aforementioned dio.
>
> This demonstrates that it is fundamentally unsafe to modify busy
> extent state for extents that are not guaranteed to be allocated.
> This applies to pretty much all of the code paths that currently
> trim busy extents for one reason or another. Therefore to address
> this problem, drop the reuse mechanism from the busy extent trim
> path. This code already knows how to return partial non-busy ranges
> of the targeted free extent and higher level code tracks the busy
> state of the allocation attempt. If a block allocation fails where
> one or more candidate extents is busy, we force the log and retry
> the allocation.
>

The changes look good to me.

Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_extent_busy.c | 14 --------------
>  1 file changed, 14 deletions(-)
>
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index 3991e59cfd18..ef17c1f6db32 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -344,7 +344,6 @@ xfs_extent_busy_trim(
>  	ASSERT(*len > 0);
>  
>  	spin_lock(&args->pag->pagb_lock);
> -restart:
>  	fbno = *bno;
>  	flen = *len;
>  	rbp = args->pag->pagb_tree.rb_node;
> @@ -363,19 +362,6 @@ xfs_extent_busy_trim(
>  			continue;
>  		}
>  
> -		/*
> -		 * If this is a metadata allocation, try to reuse the busy
> -		 * extent instead of trimming the allocation.
> -		 */
> -		if (!(args->datatype & XFS_ALLOC_USERDATA) &&
> -		    !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) {
> -			if (!xfs_extent_busy_update_extent(args->mp, args->pag,
> -							  busyp, fbno, flen,
> -							  false))
> -				goto restart;
> -			continue;
> -		}
> -
>  		if (bbno <= fbno) {
>  			/* start overlap */

Brian Foster Feb. 23, 2021, 12:31 p.m. UTC | #3

On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> > Freed extents are marked busy from the point the freeing transaction
> > commits until the associated CIL context is checkpointed to the log.
> > This prevents reuse and overwrite of recently freed blocks before
> > the changes are committed to disk, which can lead to corruption
> > after a crash. The exception to this rule is that metadata
> > allocation is allowed to reuse busy extents because metadata changes
> > are also logged.
> > 
> > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> > has allowed modification or complete invalidation of outstanding
> > busy extents for metadata allocations. This implementation assumes
> > that use of the associated extent is imminent, which is not always
> > the case. For example, the trimmed extent might not satisfy the
> > minimum length of the allocation request, or the allocation
> > algorithm might be involved in a search for the optimal result based
> > on locality.
> > 
> > generic/019 reproduces a corruption caused by this scenario. First,
> > a metadata block (usually a bmbt or symlink block) is freed from an
> > inode. A subsequent bmbt split on an unrelated inode attempts a near
> > mode allocation request that invalidates the busy block during the
> > search, but does not ultimately allocate it. Due to the busy state
> > invalidation, the block is no longer considered busy to subsequent
> > allocation. A direct I/O write request immediately allocates the
> > block and writes to it.
> 
> I really hope there's a fstest case coming for this... :)
> 

generic/019? :) I'm not sure of a good way to reproduce on demand given
the conditions required to reproduce.

> > Finally, the filesystem crashes while in a
> > state where the initial metadata block free had not committed to the
> > on-disk log. After recovery, the original metadata block is in its
> > original location as expected, but has been corrupted by the
> > aforementioned dio.
> 
> Wheee!
> 
> Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go
> find a freespace record, call xfs_extent_busy_trim (which could erase
> the busy extent entry), decide that it's not interested after all, and
> bail out without restoring the busy entry.
> 
> Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as
> we wander around the free space btrees looking for a good chunk of
> space... and doesn't restore the busy record if it decides to consider a
> different extent.
> 

Yep. I was originally curious whether the more recent allocator rework
introduced this problem somehow, but AFAICT that just refactored the
relevant allocator code and this bug has been latent in the existing
code for quite some time. That's not hugely surprising given the rare
combination of conditions required to reproduce.

> So I guess this "speculatively remove busy records and forget to restore
> them" behavior opens the door to the write allocating blocks that aren't
> yet free and nonbusy, right?  And the solution presented here is to
> avoid letting go of the busy record for the bmbt allocation, and if the
> btree split caller decides it really /must/ have that block for the bmbt
> it can force the log and try again, just like we do for a file data
> allocation?
> 

Yes, pretty much. The metadata allocation that is allowed to safely
reuse busy extents ends up invalidating a set of blocks during a NEAR
mode search (i.e. bmbt allocation), but ends up only using one of those
blocks. A data allocation immediately comes along next, finds one of the
other invalidated blocks and writes to it. A crash/recovery leaves the
invalidated busy block in its original metadata location having already
been written to by the dio.

> Another solution could have been to restore the record if we decide not
> to go ahead with the allocation, but as we haven't yet committed to
> using the space, there's no sense in thrashing the busy records?
> 

That was my original thought as well. Then after looking through the
code a bit I thought that something like allowing the allocator to
"track" a reusable, but still busy extent until allocation is imminent
might be a bit more straightforward of an implementation given the
layering between the allocator and busy extent tracking code. IOW, we'd
split the busy trim/available and busy invalidate logic into two steps
instead of doing it immediately in the busy trim path. That would allow
the allocator to consider the same set of reusable busy blocks but not
commit to any of them until the allocation search is complete.

However, either of those options require a bit of thought and rework
(and perhaps some value proposition justification for the complexity)
while the current trim reuse code is pretty much bolted on and broken.
Therefore, I think it's appropriate to fix the bug in one step and
follow up with a different implementation separately.

> Assuming I got all that right,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 

Thanks.

Brian

> --D
> 
> 
> > This demonstrates that it is fundamentally unsafe to modify busy
> > extent state for extents that are not guaranteed to be allocated.
> > This applies to pretty much all of the code paths that currently
> > trim busy extents for one reason or another. Therefore to address
> > this problem, drop the reuse mechanism from the busy extent trim
> > path. This code already knows how to return partial non-busy ranges
> > of the targeted free extent and higher level code tracks the busy
> > state of the allocation attempt. If a block allocation fails where
> > one or more candidate extents is busy, we force the log and retry
> > the allocation.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_extent_busy.c | 14 --------------
> >  1 file changed, 14 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > index 3991e59cfd18..ef17c1f6db32 100644
> > --- a/fs/xfs/xfs_extent_busy.c
> > +++ b/fs/xfs/xfs_extent_busy.c
> > @@ -344,7 +344,6 @@ xfs_extent_busy_trim(
> >  	ASSERT(*len > 0);
> >  
> >  	spin_lock(&args->pag->pagb_lock);
> > -restart:
> >  	fbno = *bno;
> >  	flen = *len;
> >  	rbp = args->pag->pagb_tree.rb_node;
> > @@ -363,19 +362,6 @@ xfs_extent_busy_trim(
> >  			continue;
> >  		}
> >  
> > -		/*
> > -		 * If this is a metadata allocation, try to reuse the busy
> > -		 * extent instead of trimming the allocation.
> > -		 */
> > -		if (!(args->datatype & XFS_ALLOC_USERDATA) &&
> > -		    !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) {
> > -			if (!xfs_extent_busy_update_extent(args->mp, args->pag,
> > -							  busyp, fbno, flen,
> > -							  false))
> > -				goto restart;
> > -			continue;
> > -		}
> > -
> >  		if (bbno <= fbno) {
> >  			/* start overlap */
> >  
> > -- 
> > 2.26.2
> > 
>

Darrick J. Wong Feb. 23, 2021, 6:22 p.m. UTC | #4

On Tue, Feb 23, 2021 at 07:31:06AM -0500, Brian Foster wrote:
> On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> > > Freed extents are marked busy from the point the freeing transaction
> > > commits until the associated CIL context is checkpointed to the log.
> > > This prevents reuse and overwrite of recently freed blocks before
> > > the changes are committed to disk, which can lead to corruption
> > > after a crash. The exception to this rule is that metadata
> > > allocation is allowed to reuse busy extents because metadata changes
> > > are also logged.
> > > 
> > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> > > has allowed modification or complete invalidation of outstanding
> > > busy extents for metadata allocations. This implementation assumes
> > > that use of the associated extent is imminent, which is not always
> > > the case. For example, the trimmed extent might not satisfy the
> > > minimum length of the allocation request, or the allocation
> > > algorithm might be involved in a search for the optimal result based
> > > on locality.
> > > 
> > > generic/019 reproduces a corruption caused by this scenario. First,
> > > a metadata block (usually a bmbt or symlink block) is freed from an
> > > inode. A subsequent bmbt split on an unrelated inode attempts a near
> > > mode allocation request that invalidates the busy block during the
> > > search, but does not ultimately allocate it. Due to the busy state
> > > invalidation, the block is no longer considered busy to subsequent
> > > allocation. A direct I/O write request immediately allocates the
> > > block and writes to it.
> > 
> > I really hope there's a fstest case coming for this... :)
> > 
> 
> generic/019? :) I'm not sure of a good way to reproduce on demand given
> the conditions required to reproduce.

<nod> I guess you'd have to have a fs where extents take a long time to
exit the busy tree, and then set up the allocations and frees just
right.  FWIW I've never hit this in generic/019.

> > > Finally, the filesystem crashes while in a
> > > state where the initial metadata block free had not committed to the
> > > on-disk log. After recovery, the original metadata block is in its
> > > original location as expected, but has been corrupted by the
> > > aforementioned dio.
> > 
> > Wheee!
> > 
> > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go
> > find a freespace record, call xfs_extent_busy_trim (which could erase
> > the busy extent entry), decide that it's not interested after all, and
> > bail out without restoring the busy entry.
> > 
> > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as
> > we wander around the free space btrees looking for a good chunk of
> > space... and doesn't restore the busy record if it decides to consider a
> > different extent.
> > 
> 
> Yep. I was originally curious whether the more recent allocator rework
> introduced this problem somehow, but AFAICT that just refactored the
> relevant allocator code and this bug has been latent in the existing
> code for quite some time. That's not hugely surprising given the rare
> combination of conditions required to reproduce.
> 
> > So I guess this "speculatively remove busy records and forget to restore
> > them" behavior opens the door to the write allocating blocks that aren't
> > yet free and nonbusy, right?  And the solution presented here is to
> > avoid letting go of the busy record for the bmbt allocation, and if the
> > btree split caller decides it really /must/ have that block for the bmbt
> > it can force the log and try again, just like we do for a file data
> > allocation?
> > 
> 
> Yes, pretty much. The metadata allocation that is allowed to safely
> reuse busy extents ends up invalidating a set of blocks during a NEAR
> mode search (i.e. bmbt allocation), but ends up only using one of those
> blocks. A data allocation immediately comes along next, finds one of the
> other invalidated blocks and writes to it. A crash/recovery leaves the
> invalidated busy block in its original metadata location having already
> been written to by the dio.
> 
> > Another solution could have been to restore the record if we decide not
> > to go ahead with the allocation, but as we haven't yet committed to
> > using the space, there's no sense in thrashing the busy records?
> > 
> 
> That was my original thought as well. Then after looking through the
> code a bit I thought that something like allowing the allocator to
> "track" a reusable, but still busy extent until allocation is imminent
> might be a bit more straightforward of an implementation given the
> layering between the allocator and busy extent tracking code. IOW, we'd
> split the busy trim/available and busy invalidate logic into two steps
> instead of doing it immediately in the busy trim path. That would allow
> the allocator to consider the same set of reusable busy blocks but not
> commit to any of them until the allocation search is complete.
> 
> However, either of those options require a bit of thought and rework
> (and perhaps some value proposition justification for the complexity)
> while the current trim reuse code is pretty much bolted on and broken.
> Therefore, I think it's appropriate to fix the bug in one step and
> follow up with a different implementation separately.

<nod> I'm not sure that's even worth the effort... :)

--D

> > Assuming I got all that right,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> 
> Thanks.
> 
> Brian
> 
> > --D
> > 
> > 
> > > This demonstrates that it is fundamentally unsafe to modify busy
> > > extent state for extents that are not guaranteed to be allocated.
> > > This applies to pretty much all of the code paths that currently
> > > trim busy extents for one reason or another. Therefore to address
> > > this problem, drop the reuse mechanism from the busy extent trim
> > > path. This code already knows how to return partial non-busy ranges
> > > of the targeted free extent and higher level code tracks the busy
> > > state of the allocation attempt. If a block allocation fails where
> > > one or more candidate extents is busy, we force the log and retry
> > > the allocation.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_extent_busy.c | 14 --------------
> > >  1 file changed, 14 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > index 3991e59cfd18..ef17c1f6db32 100644
> > > --- a/fs/xfs/xfs_extent_busy.c
> > > +++ b/fs/xfs/xfs_extent_busy.c
> > > @@ -344,7 +344,6 @@ xfs_extent_busy_trim(
> > >  	ASSERT(*len > 0);
> > >  
> > >  	spin_lock(&args->pag->pagb_lock);
> > > -restart:
> > >  	fbno = *bno;
> > >  	flen = *len;
> > >  	rbp = args->pag->pagb_tree.rb_node;
> > > @@ -363,19 +362,6 @@ xfs_extent_busy_trim(
> > >  			continue;
> > >  		}
> > >  
> > > -		/*
> > > -		 * If this is a metadata allocation, try to reuse the busy
> > > -		 * extent instead of trimming the allocation.
> > > -		 */
> > > -		if (!(args->datatype & XFS_ALLOC_USERDATA) &&
> > > -		    !(busyp->flags & XFS_EXTENT_BUSY_DISCARDED)) {
> > > -			if (!xfs_extent_busy_update_extent(args->mp, args->pag,
> > > -							  busyp, fbno, flen,
> > > -							  false))
> > > -				goto restart;
> > > -			continue;
> > > -		}
> > > -
> > >  		if (bbno <= fbno) {
> > >  			/* start overlap */
> > >  
> > > -- 
> > > 2.26.2
> > > 
> > 
>

Christoph Hellwig Feb. 25, 2021, 7:51 a.m. UTC | #5

As a quick fix this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

That beeing said we really need to go back and look into this,
especially due to discards.  For SSDs it is generlly much better to
quickly reuse freed blocks rather than discarding them later.

Brian Foster Feb. 25, 2021, 6:05 p.m. UTC | #6

On Thu, Feb 25, 2021 at 07:51:05AM +0000, Christoph Hellwig wrote:
> As a quick fix this looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> 
> That beeing said we really need to go back and look into this,
> especially due to discards.  For SSDs it is generlly much better to
> quickly reuse freed blocks rather than discarding them later.
> 

Ok, that's an interesting point. I'm not sure online discard is a super
critical use case, but I agree that there's a tangible advantage to
optimizing out pending discards in that configuration.

That also raises a caveat with the alternative implementation I was
mulling over. The current implementation simply skips over extents that
are busy and already under discard. If we were to find a prospective
reusable busy block (not under discard), allocate it, and then commit to
reusing it, we'd have to deal with the fact that we could find it under
discard at that point. We can't easily skip it because we've already
performed an allocation in the transaction by that point. I suspect the
simplest solution is just wait for the discard to complete since we
already have somewhat of a mechanism to do that, but hopefully it
wouldn't be a frequent occurence.

Brian

Amir Goldstein May 26, 2022, 11:34 a.m. UTC | #7

On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> > > Freed extents are marked busy from the point the freeing transaction
> > > commits until the associated CIL context is checkpointed to the log.
> > > This prevents reuse and overwrite of recently freed blocks before
> > > the changes are committed to disk, which can lead to corruption
> > > after a crash. The exception to this rule is that metadata
> > > allocation is allowed to reuse busy extents because metadata changes
> > > are also logged.
> > >
> > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> > > has allowed modification or complete invalidation of outstanding
> > > busy extents for metadata allocations. This implementation assumes
> > > that use of the associated extent is imminent, which is not always
> > > the case. For example, the trimmed extent might not satisfy the
> > > minimum length of the allocation request, or the allocation
> > > algorithm might be involved in a search for the optimal result based
> > > on locality.
> > >
> > > generic/019 reproduces a corruption caused by this scenario. First,
> > > a metadata block (usually a bmbt or symlink block) is freed from an
> > > inode. A subsequent bmbt split on an unrelated inode attempts a near
> > > mode allocation request that invalidates the busy block during the
> > > search, but does not ultimately allocate it. Due to the busy state
> > > invalidation, the block is no longer considered busy to subsequent
> > > allocation. A direct I/O write request immediately allocates the
> > > block and writes to it.
> >
> > I really hope there's a fstest case coming for this... :)
> >
>
> generic/019? :) I'm not sure of a good way to reproduce on demand given
> the conditions required to reproduce.
>
> > > Finally, the filesystem crashes while in a
> > > state where the initial metadata block free had not committed to the
> > > on-disk log. After recovery, the original metadata block is in its
> > > original location as expected, but has been corrupted by the
> > > aforementioned dio.
> >
> > Wheee!
> >
> > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go
> > find a freespace record, call xfs_extent_busy_trim (which could erase
> > the busy extent entry), decide that it's not interested after all, and
> > bail out without restoring the busy entry.
> >
> > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as
> > we wander around the free space btrees looking for a good chunk of
> > space... and doesn't restore the busy record if it decides to consider a
> > different extent.
> >
>
> Yep. I was originally curious whether the more recent allocator rework
> introduced this problem somehow, but AFAICT that just refactored the
> relevant allocator code and this bug has been latent in the existing
> code for quite some time. That's not hugely surprising given the rare
> combination of conditions required to reproduce.
>
> > So I guess this "speculatively remove busy records and forget to restore
> > them" behavior opens the door to the write allocating blocks that aren't
> > yet free and nonbusy, right?  And the solution presented here is to
> > avoid letting go of the busy record for the bmbt allocation, and if the
> > btree split caller decides it really /must/ have that block for the bmbt
> > it can force the log and try again, just like we do for a file data
> > allocation?
> >
>
> Yes, pretty much. The metadata allocation that is allowed to safely
> reuse busy extents ends up invalidating a set of blocks during a NEAR
> mode search (i.e. bmbt allocation), but ends up only using one of those
> blocks. A data allocation immediately comes along next, finds one of the
> other invalidated blocks and writes to it. A crash/recovery leaves the
> invalidated busy block in its original metadata location having already
> been written to by the dio.
>
> > Another solution could have been to restore the record if we decide not
> > to go ahead with the allocation, but as we haven't yet committed to
> > using the space, there's no sense in thrashing the busy records?
> >
>
> That was my original thought as well. Then after looking through the
> code a bit I thought that something like allowing the allocator to
> "track" a reusable, but still busy extent until allocation is imminent
> might be a bit more straightforward of an implementation given the
> layering between the allocator and busy extent tracking code. IOW, we'd
> split the busy trim/available and busy invalidate logic into two steps
> instead of doing it immediately in the busy trim path. That would allow
> the allocator to consider the same set of reusable busy blocks but not
> commit to any of them until the allocation search is complete.
>
> However, either of those options require a bit of thought and rework
> (and perhaps some value proposition justification for the complexity)
> while the current trim reuse code is pretty much bolted on and broken.
> Therefore, I think it's appropriate to fix the bug in one step and
> follow up with a different implementation separately.
>

Hi Brian,

This patch was one of my selected fixes to backport for v5.10.y.
It has a very scary looking commit message and the change seems
to be independent of any infrastructure changes(?).

The problem is that applying this patch to v5.10.y reliably reproduces
this buffer corruption assertion [*] with test xfs/076.

This happens on the kdevops system that is using loop devices over
sparse files inside qemu images. It does not reproduce on my small
VM at home.

Normally, I would just drop this patch from the stable candidates queue
and move on, but I thought you might be interested to investigate this
reliable reproducer, because maybe this system exercises an error
that is otherwise rare to hit.

It seemed weird to me that NOT reusing the extent would result in
data corruption, but it could indicate that reusing the extent was masking
the assertion and hiding another bug(?).

Can you think of another reason to explain the regression this fix
introduces to 5.10.y?

Do you care to investigate this failure or shall I just move on?

Thanks,
Amir.

[*]
: XFS (loop5): Internal error xfs_trans_cancel at line 954 of file
fs/xfs/xfs_trans.c.  Caller xfs_create+0x22f/0x590 [xfs]
: CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G            E
   5.10.109-xfs-2 #8
: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
04/01/2014
: Call Trace:
:  dump_stack+0x6d/0x88
:  xfs_trans_cancel+0x17b/0x1a0 [xfs]
:  xfs_create+0x22f/0x590 [xfs]
:  xfs_generic_create+0x245/0x310 [xfs]
:  ? d_splice_alias+0x13a/0x3c0
:  path_openat+0xe3f/0x1080
:  do_filp_open+0x93/0x100
:  ? handle_mm_fault+0x148e/0x1690
:  ? __check_object_size+0x162/0x180
:  do_sys_openat2+0x228/0x2d0
:  do_sys_open+0x4b/0x80
:  do_syscall_64+0x33/0x80
:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
: RIP: 0033:0x7f36b02eff1e
: Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00
85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d
00 f0 ff ff 0
: RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
: RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e
: RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c
: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
: R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
: R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001
: XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file
fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764
: XFS (loop5): Corruption of in-memory data detected.  Shutting down filesystem
: XFS (loop5): Please unmount the filesystem and rectify the problem(s)

Brian Foster May 26, 2022, 2:21 p.m. UTC | #8

On Thu, May 26, 2022 at 02:34:38PM +0300, Amir Goldstein wrote:
> On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
> > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> > > > Freed extents are marked busy from the point the freeing transaction
> > > > commits until the associated CIL context is checkpointed to the log.
> > > > This prevents reuse and overwrite of recently freed blocks before
> > > > the changes are committed to disk, which can lead to corruption
> > > > after a crash. The exception to this rule is that metadata
> > > > allocation is allowed to reuse busy extents because metadata changes
> > > > are also logged.
> > > >
> > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> > > > has allowed modification or complete invalidation of outstanding
> > > > busy extents for metadata allocations. This implementation assumes
> > > > that use of the associated extent is imminent, which is not always
> > > > the case. For example, the trimmed extent might not satisfy the
> > > > minimum length of the allocation request, or the allocation
> > > > algorithm might be involved in a search for the optimal result based
> > > > on locality.
> > > >
> > > > generic/019 reproduces a corruption caused by this scenario. First,
> > > > a metadata block (usually a bmbt or symlink block) is freed from an
> > > > inode. A subsequent bmbt split on an unrelated inode attempts a near
> > > > mode allocation request that invalidates the busy block during the
> > > > search, but does not ultimately allocate it. Due to the busy state
> > > > invalidation, the block is no longer considered busy to subsequent
> > > > allocation. A direct I/O write request immediately allocates the
> > > > block and writes to it.
> > >
> > > I really hope there's a fstest case coming for this... :)
> > >
> >
> > generic/019? :) I'm not sure of a good way to reproduce on demand given
> > the conditions required to reproduce.
> >
> > > > Finally, the filesystem crashes while in a
> > > > state where the initial metadata block free had not committed to the
> > > > on-disk log. After recovery, the original metadata block is in its
> > > > original location as expected, but has been corrupted by the
> > > > aforementioned dio.
> > >
> > > Wheee!
> > >
> > > Looking at xfs_alloc_ag_vextent_exact, I guess the allocator will go
> > > find a freespace record, call xfs_extent_busy_trim (which could erase
> > > the busy extent entry), decide that it's not interested after all, and
> > > bail out without restoring the busy entry.
> > >
> > > Similarly, xfs_alloc_cur_check calls _busy_trim (same side effects) as
> > > we wander around the free space btrees looking for a good chunk of
> > > space... and doesn't restore the busy record if it decides to consider a
> > > different extent.
> > >
> >
> > Yep. I was originally curious whether the more recent allocator rework
> > introduced this problem somehow, but AFAICT that just refactored the
> > relevant allocator code and this bug has been latent in the existing
> > code for quite some time. That's not hugely surprising given the rare
> > combination of conditions required to reproduce.
> >
> > > So I guess this "speculatively remove busy records and forget to restore
> > > them" behavior opens the door to the write allocating blocks that aren't
> > > yet free and nonbusy, right?  And the solution presented here is to
> > > avoid letting go of the busy record for the bmbt allocation, and if the
> > > btree split caller decides it really /must/ have that block for the bmbt
> > > it can force the log and try again, just like we do for a file data
> > > allocation?
> > >
> >
> > Yes, pretty much. The metadata allocation that is allowed to safely
> > reuse busy extents ends up invalidating a set of blocks during a NEAR
> > mode search (i.e. bmbt allocation), but ends up only using one of those
> > blocks. A data allocation immediately comes along next, finds one of the
> > other invalidated blocks and writes to it. A crash/recovery leaves the
> > invalidated busy block in its original metadata location having already
> > been written to by the dio.
> >
> > > Another solution could have been to restore the record if we decide not
> > > to go ahead with the allocation, but as we haven't yet committed to
> > > using the space, there's no sense in thrashing the busy records?
> > >
> >
> > That was my original thought as well. Then after looking through the
> > code a bit I thought that something like allowing the allocator to
> > "track" a reusable, but still busy extent until allocation is imminent
> > might be a bit more straightforward of an implementation given the
> > layering between the allocator and busy extent tracking code. IOW, we'd
> > split the busy trim/available and busy invalidate logic into two steps
> > instead of doing it immediately in the busy trim path. That would allow
> > the allocator to consider the same set of reusable busy blocks but not
> > commit to any of them until the allocation search is complete.
> >
> > However, either of those options require a bit of thought and rework
> > (and perhaps some value proposition justification for the complexity)
> > while the current trim reuse code is pretty much bolted on and broken.
> > Therefore, I think it's appropriate to fix the bug in one step and
> > follow up with a different implementation separately.
> >
> 
> Hi Brian,
> 
> This patch was one of my selected fixes to backport for v5.10.y.
> It has a very scary looking commit message and the change seems
> to be independent of any infrastructure changes(?).
> 
> The problem is that applying this patch to v5.10.y reliably reproduces
> this buffer corruption assertion [*] with test xfs/076.
> 
> This happens on the kdevops system that is using loop devices over
> sparse files inside qemu images. It does not reproduce on my small
> VM at home.
> 
> Normally, I would just drop this patch from the stable candidates queue
> and move on, but I thought you might be interested to investigate this
> reliable reproducer, because maybe this system exercises an error
> that is otherwise rare to hit.
> 
> It seemed weird to me that NOT reusing the extent would result in
> data corruption, but it could indicate that reusing the extent was masking
> the assertion and hiding another bug(?).
> 

Indeed, this does seem like an odd failure. The shutdown on transaction
cancel implies cancellation of a dirty transaction. This is not
necessarily corruption as much as just being the generic
naming/messaging related to shutdowns due to unexpected in-core state.
The patch in question removes some modifications to in-core busy extent
state during extent allocation that are fundamentally unsafe in
combination with how allocation works. This change doesn't appear to
affect any transaction directly, so the correlation may be indirect.

xfs/076 looks like it's a sparse inode allocation test, which certainly
seems relevant in that it is stressing the ability to allocate inode
chunks under free space fragmentation. If this patch further restricts
extent allocation by removing availability of some set of (recently
freed, busy) extents, then perhaps there is some allocation failure
sequence that was previously unlikely enough to mask some poor error
handling logic or transaction handling (like an agfl fixup dirtying a
transaction followed by an allocation failure, for example) that we're
now running into.

> Can you think of another reason to explain the regression this fix
> introduces to 5.10.y?
> 

Not off the top of my head. Something along the lines of the above seems
plausible, but that's just speculation at this point.

> Do you care to investigate this failure or shall I just move on?
> 

I think it would be good to understand whether there's a regression
introduced by this patch, a bug somewhere else or just some impedence
mismatch in logic between the combination of this change and whatever
else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
if I pull just this commit back into latest 5.10.y (5.10.118). I've
tried with a traditional bdev as well as a preallocated and sparse
loopback scratch dev. Have you tested this patch (backport) in isolation
in your reproducer env or only in combination with other pending
backports?

Brian

> Thanks,
> Amir.
> 
> [*]
> : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file
> fs/xfs/xfs_trans.c.  Caller xfs_create+0x22f/0x590 [xfs]
> : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G            E
>    5.10.109-xfs-2 #8
> : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> 04/01/2014
> : Call Trace:
> :  dump_stack+0x6d/0x88
> :  xfs_trans_cancel+0x17b/0x1a0 [xfs]
> :  xfs_create+0x22f/0x590 [xfs]
> :  xfs_generic_create+0x245/0x310 [xfs]
> :  ? d_splice_alias+0x13a/0x3c0
> :  path_openat+0xe3f/0x1080
> :  do_filp_open+0x93/0x100
> :  ? handle_mm_fault+0x148e/0x1690
> :  ? __check_object_size+0x162/0x180
> :  do_sys_openat2+0x228/0x2d0
> :  do_sys_open+0x4b/0x80
> :  do_syscall_64+0x33/0x80
> :  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> : RIP: 0033:0x7f36b02eff1e
> : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00
> 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d
> 00 f0 ff ff 0
> : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e
> : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c
> : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
> : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001
> : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file
> fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764
> : XFS (loop5): Corruption of in-memory data detected.  Shutting down filesystem
> : XFS (loop5): Please unmount the filesystem and rectify the problem(s)
>

Amir Goldstein May 26, 2022, 3:28 p.m. UTC | #9

> > Hi Brian,
> >
> > This patch was one of my selected fixes to backport for v5.10.y.
> > It has a very scary looking commit message and the change seems
> > to be independent of any infrastructure changes(?).
> >
> > The problem is that applying this patch to v5.10.y reliably reproduces
> > this buffer corruption assertion [*] with test xfs/076.
> >
> > This happens on the kdevops system that is using loop devices over
> > sparse files inside qemu images. It does not reproduce on my small
> > VM at home.
> >
> > Normally, I would just drop this patch from the stable candidates queue
> > and move on, but I thought you might be interested to investigate this
> > reliable reproducer, because maybe this system exercises an error
> > that is otherwise rare to hit.
> >
> > It seemed weird to me that NOT reusing the extent would result in
> > data corruption, but it could indicate that reusing the extent was masking
> > the assertion and hiding another bug(?).
> >
>
> Indeed, this does seem like an odd failure. The shutdown on transaction
> cancel implies cancellation of a dirty transaction. This is not
> necessarily corruption as much as just being the generic
> naming/messaging related to shutdowns due to unexpected in-core state.
> The patch in question removes some modifications to in-core busy extent
> state during extent allocation that are fundamentally unsafe in
> combination with how allocation works. This change doesn't appear to
> affect any transaction directly, so the correlation may be indirect.
>
> xfs/076 looks like it's a sparse inode allocation test, which certainly
> seems relevant in that it is stressing the ability to allocate inode
> chunks under free space fragmentation. If this patch further restricts
> extent allocation by removing availability of some set of (recently
> freed, busy) extents, then perhaps there is some allocation failure
> sequence that was previously unlikely enough to mask some poor error
> handling logic or transaction handling (like an agfl fixup dirtying a
> transaction followed by an allocation failure, for example) that we're
> now running into.
>
> > Can you think of another reason to explain the regression this fix
> > introduces to 5.10.y?
> >
>
> Not off the top of my head. Something along the lines of the above seems
> plausible, but that's just speculation at this point.
>
> > Do you care to investigate this failure or shall I just move on?
> >
>
> I think it would be good to understand whether there's a regression
> introduced by this patch, a bug somewhere else or just some impedence
> mismatch in logic between the combination of this change and whatever
> else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> if I pull just this commit back into latest 5.10.y (5.10.118). I've
> tried with a traditional bdev as well as a preallocated and sparse
> loopback scratch dev.

I also failed to reproduce it on another VM, but it reproduces reliably
on this system. That's why I thought we'd better use this opportunity.
This system has lots of RAM and disk to spare so I have no problem
running this test in a VM in parallel to my work.

It is not actually my system, it's a system that Luis has setup for
stable XFS testing and gave me access to, so if the need arises
you could get direct access to the system, but for now, I have no
problem running the test for you.

> Have you tested this patch (backport) in isolation
> in your reproducer env or only in combination with other pending
> backports?
>

I tested it on top of 5.10.109 + these 5 patches:
https://github.com/amir73il/linux/commits/xfs-5.10.y-1

I can test it in isolation if you like. Let me know if there are
other forensics that you would like me to collect.

Thanks,
Amir.

Amir Goldstein May 26, 2022, 7:33 p.m. UTC | #10

On Thu, May 26, 2022 at 6:28 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > Hi Brian,
> > >
> > > This patch was one of my selected fixes to backport for v5.10.y.
> > > It has a very scary looking commit message and the change seems
> > > to be independent of any infrastructure changes(?).
> > >
> > > The problem is that applying this patch to v5.10.y reliably reproduces
> > > this buffer corruption assertion [*] with test xfs/076.
> > >
> > > This happens on the kdevops system that is using loop devices over
> > > sparse files inside qemu images. It does not reproduce on my small
> > > VM at home.
> > >
> > > Normally, I would just drop this patch from the stable candidates queue
> > > and move on, but I thought you might be interested to investigate this
> > > reliable reproducer, because maybe this system exercises an error
> > > that is otherwise rare to hit.
> > >
> > > It seemed weird to me that NOT reusing the extent would result in
> > > data corruption, but it could indicate that reusing the extent was masking
> > > the assertion and hiding another bug(?).
> > >
> >
> > Indeed, this does seem like an odd failure. The shutdown on transaction
> > cancel implies cancellation of a dirty transaction. This is not
> > necessarily corruption as much as just being the generic
> > naming/messaging related to shutdowns due to unexpected in-core state.
> > The patch in question removes some modifications to in-core busy extent
> > state during extent allocation that are fundamentally unsafe in
> > combination with how allocation works. This change doesn't appear to
> > affect any transaction directly, so the correlation may be indirect.
> >
> > xfs/076 looks like it's a sparse inode allocation test, which certainly
> > seems relevant in that it is stressing the ability to allocate inode
> > chunks under free space fragmentation. If this patch further restricts
> > extent allocation by removing availability of some set of (recently
> > freed, busy) extents, then perhaps there is some allocation failure
> > sequence that was previously unlikely enough to mask some poor error
> > handling logic or transaction handling (like an agfl fixup dirtying a
> > transaction followed by an allocation failure, for example) that we're
> > now running into.
> >
> > > Can you think of another reason to explain the regression this fix
> > > introduces to 5.10.y?
> > >
> >
> > Not off the top of my head. Something along the lines of the above seems
> > plausible, but that's just speculation at this point.
> >
> > > Do you care to investigate this failure or shall I just move on?
> > >
> >
> > I think it would be good to understand whether there's a regression
> > introduced by this patch, a bug somewhere else or just some impedence
> > mismatch in logic between the combination of this change and whatever
> > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> > if I pull just this commit back into latest 5.10.y (5.10.118). I've
> > tried with a traditional bdev as well as a preallocated and sparse
> > loopback scratch dev.
>
> I also failed to reproduce it on another VM, but it reproduces reliably
> on this system. That's why I thought we'd better use this opportunity.
> This system has lots of RAM and disk to spare so I have no problem
> running this test in a VM in parallel to my work.
>
> It is not actually my system, it's a system that Luis has setup for
> stable XFS testing and gave me access to, so if the need arises
> you could get direct access to the system, but for now, I have no
> problem running the test for you.
>
> > Have you tested this patch (backport) in isolation
> > in your reproducer env or only in combination with other pending
> > backports?
> >
>
> I tested it on top of 5.10.109 + these 5 patches:
> https://github.com/amir73il/linux/commits/xfs-5.10.y-1
>
> I can test it in isolation if you like. Let me know if there are
> other forensics that you would like me to collect.
>.

FWIW, the test fails with just that one patch over 5.10.109.

Thanks,
Amir.

Brian Foster May 26, 2022, 7:47 p.m. UTC | #11

On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote:
> > > Hi Brian,
> > >
> > > This patch was one of my selected fixes to backport for v5.10.y.
> > > It has a very scary looking commit message and the change seems
> > > to be independent of any infrastructure changes(?).
> > >
> > > The problem is that applying this patch to v5.10.y reliably reproduces
> > > this buffer corruption assertion [*] with test xfs/076.
> > >
> > > This happens on the kdevops system that is using loop devices over
> > > sparse files inside qemu images. It does not reproduce on my small
> > > VM at home.
> > >
> > > Normally, I would just drop this patch from the stable candidates queue
> > > and move on, but I thought you might be interested to investigate this
> > > reliable reproducer, because maybe this system exercises an error
> > > that is otherwise rare to hit.
> > >
> > > It seemed weird to me that NOT reusing the extent would result in
> > > data corruption, but it could indicate that reusing the extent was masking
> > > the assertion and hiding another bug(?).
> > >
> >
> > Indeed, this does seem like an odd failure. The shutdown on transaction
> > cancel implies cancellation of a dirty transaction. This is not
> > necessarily corruption as much as just being the generic
> > naming/messaging related to shutdowns due to unexpected in-core state.
> > The patch in question removes some modifications to in-core busy extent
> > state during extent allocation that are fundamentally unsafe in
> > combination with how allocation works. This change doesn't appear to
> > affect any transaction directly, so the correlation may be indirect.
> >
> > xfs/076 looks like it's a sparse inode allocation test, which certainly
> > seems relevant in that it is stressing the ability to allocate inode
> > chunks under free space fragmentation. If this patch further restricts
> > extent allocation by removing availability of some set of (recently
> > freed, busy) extents, then perhaps there is some allocation failure
> > sequence that was previously unlikely enough to mask some poor error
> > handling logic or transaction handling (like an agfl fixup dirtying a
> > transaction followed by an allocation failure, for example) that we're
> > now running into.
> >
> > > Can you think of another reason to explain the regression this fix
> > > introduces to 5.10.y?
> > >
> >
> > Not off the top of my head. Something along the lines of the above seems
> > plausible, but that's just speculation at this point.
> >
> > > Do you care to investigate this failure or shall I just move on?
> > >
> >
> > I think it would be good to understand whether there's a regression
> > introduced by this patch, a bug somewhere else or just some impedence
> > mismatch in logic between the combination of this change and whatever
> > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> > if I pull just this commit back into latest 5.10.y (5.10.118). I've
> > tried with a traditional bdev as well as a preallocated and sparse
> > loopback scratch dev.
> 
> I also failed to reproduce it on another VM, but it reproduces reliably
> on this system. That's why I thought we'd better use this opportunity.
> This system has lots of RAM and disk to spare so I have no problem
> running this test in a VM in parallel to my work.
> 
> It is not actually my system, it's a system that Luis has setup for
> stable XFS testing and gave me access to, so if the need arises
> you could get direct access to the system, but for now, I have no
> problem running the test for you.
> 
> > Have you tested this patch (backport) in isolation
> > in your reproducer env or only in combination with other pending
> > backports?
> >
> 
> I tested it on top of 5.10.109 + these 5 patches:
> https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> 
> I can test it in isolation if you like. Let me know if there are
> other forensics that you would like me to collect.
> 

Hm. Still no luck if I move to .109 and pull in those few patches. I
assume there's nothing else potentially interesting about the test env
other than the sparse file scratch dev (i.e., default mkfs options,
etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt
to try and grab a tracepoint dump of the test when it fails (feel free
to send directly or upload somewhere as the list may punt it, and please
also include the dmesg output that goes along with it) and I can see if
that shows anything helpful.

I think what we want to know initially is what error code we're
producing (-ENOSPC?) and where it originates, and from there we can
probably work out how the transaction might be dirty. I'm not sure a
trace dump will express that conclusively. If you wanted to increase the
odds of getting some useful information it might be helpful to stick a
few trace_printk() calls in the various trans cancel error paths out of
xfs_create() to determine whether it's the inode allocation attempt that
fails or the subsequent attempt to create the directory entry..

Brian

> Thanks,
> Amir.
>

Amir Goldstein May 26, 2022, 8:56 p.m. UTC | #12

> > I tested it on top of 5.10.109 + these 5 patches:
> > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> >
> > I can test it in isolation if you like. Let me know if there are
> > other forensics that you would like me to collect.
> >
>
> Hm. Still no luck if I move to .109 and pull in those few patches. I
> assume there's nothing else potentially interesting about the test env
> other than the sparse file scratch dev (i.e., default mkfs options,

Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults
are reflink=0.

Actually, the section I am running is reflink_normapbt, but...

** mkfs failed with extra mkfs options added to "-f -m
reflink=1,rmapbt=0, -i sparse=1," by test 076 **
** attempting to mkfs using only test 076 options: -m crc=1 -i sparse **
** mkfs failed with extra mkfs options added to "-f -m
reflink=1,rmapbt=0, -i sparse=1," by test 076 **
** attempting to mkfs using only test 076 options: -d size=50m -m
crc=1 -i sparse **

mkfs.xfs does not accept double sparse argument, so the
test falls back to mkfs defaults (+ sparse)

I checked and xfsprogs 5.3 behaves the same, I did not check newer
xfsprogs, but that seems like a test bug(?)

IWO, unless xfsprogs was changed to be more tolerable to repeating
arguments, then maybe nobody is testing xfs/076 with reflink=0 (?)

> etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt
> to try and grab a tracepoint dump of the test when it fails (feel free
> to send directly or upload somewhere as the list may punt it, and please
> also include the dmesg output that goes along with it) and I can see if
> that shows anything helpful.
>
> I think what we want to know initially is what error code we're
> producing (-ENOSPC?) and where it originates, and from there we can
> probably work out how the transaction might be dirty. I'm not sure a
> trace dump will express that conclusively. If you wanted to increase the
> odds of getting some useful information it might be helpful to stick a
> few trace_printk() calls in the various trans cancel error paths out of
> xfs_create() to determine whether it's the inode allocation attempt that
> fails or the subsequent attempt to create the directory entry..
>

Well, the full output is filled with ENOSPC (also in a good run), so it's
probably that, but I will try to get to that failing stack, no need for all the
noisy traces. Signing off the day. hope I will get to it tomorrow.

Thanks,
Amir.

P.S: this is how 076.full ends if it makes any difference:

touch: cannot touch '/media/scratch/offset.21889024/63': No space left on device
touch: cannot touch '/media/scratch/offset.21823488/63': No space left on device
touch: cannot touch '/media/scratch/offset.21757952/63': No space left on device
touch: cannot touch '/media/scratch/offset.21692416/63': No space left on device
touch: cannot touch '/media/scratch/offset.21626880/63': No space left on device
touch: cannot touch '/media/scratch/offset.21561344/63': No space left on device
touch: cannot touch '/media/scratch/offset.21495808/63': No space left on device
touch: cannot touch '/media/scratch/offset.21430272/63': No space left on device
stat: Input/output error
fpunch failed

Amir Goldstein May 27, 2022, 7:06 a.m. UTC | #13

On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> > > I tested it on top of 5.10.109 + these 5 patches:
> > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> > >
> > > I can test it in isolation if you like. Let me know if there are
> > > other forensics that you would like me to collect.
> > >
> >
> > Hm. Still no luck if I move to .109 and pull in those few patches. I
> > assume there's nothing else potentially interesting about the test env
> > other than the sparse file scratch dev (i.e., default mkfs options,
>
> Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults
> are reflink=0.
>
> Actually, the section I am running is reflink_normapbt, but...
>
> ** mkfs failed with extra mkfs options added to "-f -m
> reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse **
> ** mkfs failed with extra mkfs options added to "-f -m
> reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> ** attempting to mkfs using only test 076 options: -d size=50m -m
> crc=1 -i sparse **
>
> mkfs.xfs does not accept double sparse argument, so the
> test falls back to mkfs defaults (+ sparse)
>
> I checked and xfsprogs 5.3 behaves the same, I did not check newer
> xfsprogs, but that seems like a test bug(?)
>

xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many
other tests ignore the custom mkfs options for the specific sections.
That is a big test coverage issue!

> IWO, unless xfsprogs was changed to be more tolerable to repeating
> arguments, then maybe nobody is testing xfs/076 with reflink=0 (?)
>

Bingo!
Test passes 100 runs with debian/testing - xfsprogs v5.16

I shall try to amend the test to force reflink=0 to see what happens.
You should try it as well.

Thanks,
Amir.

Brian Foster May 27, 2022, 12:50 p.m. UTC | #14

On Fri, May 27, 2022 at 10:06:46AM +0300, Amir Goldstein wrote:
> On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > > > I tested it on top of 5.10.109 + these 5 patches:
> > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> > > >
> > > > I can test it in isolation if you like. Let me know if there are
> > > > other forensics that you would like me to collect.
> > > >
> > >
> > > Hm. Still no luck if I move to .109 and pull in those few patches. I
> > > assume there's nothing else potentially interesting about the test env
> > > other than the sparse file scratch dev (i.e., default mkfs options,
> >
> > Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults
> > are reflink=0.
> >
> > Actually, the section I am running is reflink_normapbt, but...
> >
> > ** mkfs failed with extra mkfs options added to "-f -m
> > reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> > ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse **
> > ** mkfs failed with extra mkfs options added to "-f -m
> > reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> > ** attempting to mkfs using only test 076 options: -d size=50m -m
> > crc=1 -i sparse **
> >
> > mkfs.xfs does not accept double sparse argument, so the
> > test falls back to mkfs defaults (+ sparse)
> >
> > I checked and xfsprogs 5.3 behaves the same, I did not check newer
> > xfsprogs, but that seems like a test bug(?)
> >
> 
> xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many
> other tests ignore the custom mkfs options for the specific sections.
> That is a big test coverage issue!
> 
> > IWO, unless xfsprogs was changed to be more tolerable to repeating
> > arguments, then maybe nobody is testing xfs/076 with reflink=0 (?)
> >
> 
> Bingo!
> Test passes 100 runs with debian/testing - xfsprogs v5.16
> 
> I shall try to amend the test to force reflink=0 to see what happens.
> You should try it as well.
> 

Interesting. If I set -mreflink=0 xfs/076 seems to do the right thing
and format the scratch device as expected, but I'm still not seeing a
failure on my system for whatever reason.

Brian

> Thanks,
> Amir.
>

Amir Goldstein May 27, 2022, 2:16 p.m. UTC | #15

On Fri, May 27, 2022 at 3:50 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Fri, May 27, 2022 at 10:06:46AM +0300, Amir Goldstein wrote:
> > On Thu, May 26, 2022 at 11:56 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > > > I tested it on top of 5.10.109 + these 5 patches:
> > > > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> > > > >
> > > > > I can test it in isolation if you like. Let me know if there are
> > > > > other forensics that you would like me to collect.
> > > > >
> > > >
> > > > Hm. Still no luck if I move to .109 and pull in those few patches. I
> > > > assume there's nothing else potentially interesting about the test env
> > > > other than the sparse file scratch dev (i.e., default mkfs options,
> > >
> > > Oh! right, this guest is debian/10 with xfsprogs 4.20, so the defaults
> > > are reflink=0.
> > >
> > > Actually, the section I am running is reflink_normapbt, but...
> > >
> > > ** mkfs failed with extra mkfs options added to "-f -m
> > > reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> > > ** attempting to mkfs using only test 076 options: -m crc=1 -i sparse **
> > > ** mkfs failed with extra mkfs options added to "-f -m
> > > reflink=1,rmapbt=0, -i sparse=1," by test 076 **
> > > ** attempting to mkfs using only test 076 options: -d size=50m -m
> > > crc=1 -i sparse **
> > >
> > > mkfs.xfs does not accept double sparse argument, so the
> > > test falls back to mkfs defaults (+ sparse)
> > >
> > > I checked and xfsprogs 5.3 behaves the same, I did not check newer
> > > xfsprogs, but that seems like a test bug(?)
> > >
> >
> > xfsprogs 5.16 still behaves the same, meaning that xfs/076 and many many
> > other tests ignore the custom mkfs options for the specific sections.
> > That is a big test coverage issue!
> >
> > > IWO, unless xfsprogs was changed to be more tolerable to repeating
> > > arguments, then maybe nobody is testing xfs/076 with reflink=0 (?)
> > >
> >
> > Bingo!
> > Test passes 100 runs with debian/testing - xfsprogs v5.16
> >
> > I shall try to amend the test to force reflink=0 to see what happens.
> > You should try it as well.
> >
>
> Interesting. If I set -mreflink=0 xfs/076 seems to do the right thing
> and format the scratch device as expected, but I'm still not seeing a
> failure on my system for whatever reason.

Me neither (on my home VM).
I will investigate the assertion on the system where it reproduces.
May take me some time but I will get to it.

Thanks,
Amir.

Amir Goldstein May 28, 2022, 2:23 p.m. UTC | #16

On Thu, May 26, 2022 at 10:47 PM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote:
> > > > Hi Brian,
> > > >
> > > > This patch was one of my selected fixes to backport for v5.10.y.
> > > > It has a very scary looking commit message and the change seems
> > > > to be independent of any infrastructure changes(?).
> > > >
> > > > The problem is that applying this patch to v5.10.y reliably reproduces
> > > > this buffer corruption assertion [*] with test xfs/076.
> > > >
> > > > This happens on the kdevops system that is using loop devices over
> > > > sparse files inside qemu images. It does not reproduce on my small
> > > > VM at home.
> > > >
> > > > Normally, I would just drop this patch from the stable candidates queue
> > > > and move on, but I thought you might be interested to investigate this
> > > > reliable reproducer, because maybe this system exercises an error
> > > > that is otherwise rare to hit.
> > > >
> > > > It seemed weird to me that NOT reusing the extent would result in
> > > > data corruption, but it could indicate that reusing the extent was masking
> > > > the assertion and hiding another bug(?).
> > > >
> > >
> > > Indeed, this does seem like an odd failure. The shutdown on transaction
> > > cancel implies cancellation of a dirty transaction. This is not
> > > necessarily corruption as much as just being the generic
> > > naming/messaging related to shutdowns due to unexpected in-core state.
> > > The patch in question removes some modifications to in-core busy extent
> > > state during extent allocation that are fundamentally unsafe in
> > > combination with how allocation works. This change doesn't appear to
> > > affect any transaction directly, so the correlation may be indirect.
> > >
> > > xfs/076 looks like it's a sparse inode allocation test, which certainly
> > > seems relevant in that it is stressing the ability to allocate inode
> > > chunks under free space fragmentation. If this patch further restricts
> > > extent allocation by removing availability of some set of (recently
> > > freed, busy) extents, then perhaps there is some allocation failure
> > > sequence that was previously unlikely enough to mask some poor error
> > > handling logic or transaction handling (like an agfl fixup dirtying a
> > > transaction followed by an allocation failure, for example) that we're
> > > now running into.
> > >
> > > > Can you think of another reason to explain the regression this fix
> > > > introduces to 5.10.y?
> > > >
> > >
> > > Not off the top of my head. Something along the lines of the above seems
> > > plausible, but that's just speculation at this point.
> > >
> > > > Do you care to investigate this failure or shall I just move on?
> > > >
> > >
> > > I think it would be good to understand whether there's a regression
> > > introduced by this patch, a bug somewhere else or just some impedence
> > > mismatch in logic between the combination of this change and whatever
> > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> > > if I pull just this commit back into latest 5.10.y (5.10.118). I've
> > > tried with a traditional bdev as well as a preallocated and sparse
> > > loopback scratch dev.
> >
> > I also failed to reproduce it on another VM, but it reproduces reliably
> > on this system. That's why I thought we'd better use this opportunity.
> > This system has lots of RAM and disk to spare so I have no problem
> > running this test in a VM in parallel to my work.
> >
> > It is not actually my system, it's a system that Luis has setup for
> > stable XFS testing and gave me access to, so if the need arises
> > you could get direct access to the system, but for now, I have no
> > problem running the test for you.
> >
> > > Have you tested this patch (backport) in isolation
> > > in your reproducer env or only in combination with other pending
> > > backports?
> > >
> >
> > I tested it on top of 5.10.109 + these 5 patches:
> > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> >
> > I can test it in isolation if you like. Let me know if there are
> > other forensics that you would like me to collect.
> >
>
> Hm. Still no luck if I move to .109 and pull in those few patches. I
> assume there's nothing else potentially interesting about the test env
> other than the sparse file scratch dev (i.e., default mkfs options,
> etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt
> to try and grab a tracepoint dump of the test when it fails (feel free
> to send directly or upload somewhere as the list may punt it, and please
> also include the dmesg output that goes along with it) and I can see if
> that shows anything helpful.
>
> I think what we want to know initially is what error code we're
> producing (-ENOSPC?) and where it originates, and from there we can
> probably work out how the transaction might be dirty. I'm not sure a
> trace dump will express that conclusively. If you wanted to increase the
> odds of getting some useful information it might be helpful to stick a
> few trace_printk() calls in the various trans cancel error paths out of
> xfs_create() to determine whether it's the inode allocation attempt that
> fails or the subsequent attempt to create the directory entry..
>

The error (-ENOSPC) comes from this v5.10 code in xfs_dir_ialloc():

        if (!ialloc_context && !ip) {
                *ipp = NULL;
                return -ENOSPC;
        }

Which theoretically might trip after xfs_ialloc() has marked the transaction
dirty(?).

This specific code is gone with this cleanup series in v5.11:

https://lore.kernel.org/linux-xfs/20201209112820.114863-1-hsiangkao@redhat.com/

When the $SUBJECT patch is applied to v5.11.16 the test xfs/076 does not fail.

So either the $SUBJECT patch (from 5.12) is incompatible with v5.10 code
or the cleanup series somehow managed to make my system not reproduce
the bug anymore.

I will assume the former and drop this patch from v5.10.y candidates.
If you want me to continue to research the bug on v5.10 let me know
what else you want me to check.

Thanks,
Amir.

Brian Foster May 31, 2022, 3:55 p.m. UTC | #17

On Sat, May 28, 2022 at 05:23:19PM +0300, Amir Goldstein wrote:
> On Thu, May 26, 2022 at 10:47 PM Brian Foster <bfoster@redhat.com> wrote:
> >
> > On Thu, May 26, 2022 at 06:28:23PM +0300, Amir Goldstein wrote:
> > > > > Hi Brian,
> > > > >
> > > > > This patch was one of my selected fixes to backport for v5.10.y.
> > > > > It has a very scary looking commit message and the change seems
> > > > > to be independent of any infrastructure changes(?).
> > > > >
> > > > > The problem is that applying this patch to v5.10.y reliably reproduces
> > > > > this buffer corruption assertion [*] with test xfs/076.
> > > > >
> > > > > This happens on the kdevops system that is using loop devices over
> > > > > sparse files inside qemu images. It does not reproduce on my small
> > > > > VM at home.
> > > > >
> > > > > Normally, I would just drop this patch from the stable candidates queue
> > > > > and move on, but I thought you might be interested to investigate this
> > > > > reliable reproducer, because maybe this system exercises an error
> > > > > that is otherwise rare to hit.
> > > > >
> > > > > It seemed weird to me that NOT reusing the extent would result in
> > > > > data corruption, but it could indicate that reusing the extent was masking
> > > > > the assertion and hiding another bug(?).
> > > > >
> > > >
> > > > Indeed, this does seem like an odd failure. The shutdown on transaction
> > > > cancel implies cancellation of a dirty transaction. This is not
> > > > necessarily corruption as much as just being the generic
> > > > naming/messaging related to shutdowns due to unexpected in-core state.
> > > > The patch in question removes some modifications to in-core busy extent
> > > > state during extent allocation that are fundamentally unsafe in
> > > > combination with how allocation works. This change doesn't appear to
> > > > affect any transaction directly, so the correlation may be indirect.
> > > >
> > > > xfs/076 looks like it's a sparse inode allocation test, which certainly
> > > > seems relevant in that it is stressing the ability to allocate inode
> > > > chunks under free space fragmentation. If this patch further restricts
> > > > extent allocation by removing availability of some set of (recently
> > > > freed, busy) extents, then perhaps there is some allocation failure
> > > > sequence that was previously unlikely enough to mask some poor error
> > > > handling logic or transaction handling (like an agfl fixup dirtying a
> > > > transaction followed by an allocation failure, for example) that we're
> > > > now running into.
> > > >
> > > > > Can you think of another reason to explain the regression this fix
> > > > > introduces to 5.10.y?
> > > > >
> > > >
> > > > Not off the top of my head. Something along the lines of the above seems
> > > > plausible, but that's just speculation at this point.
> > > >
> > > > > Do you care to investigate this failure or shall I just move on?
> > > > >
> > > >
> > > > I think it would be good to understand whether there's a regression
> > > > introduced by this patch, a bug somewhere else or just some impedence
> > > > mismatch in logic between the combination of this change and whatever
> > > > else happens to be in v5.10.y. Unfortunately I'm not able to reproduce
> > > > if I pull just this commit back into latest 5.10.y (5.10.118). I've
> > > > tried with a traditional bdev as well as a preallocated and sparse
> > > > loopback scratch dev.
> > >
> > > I also failed to reproduce it on another VM, but it reproduces reliably
> > > on this system. That's why I thought we'd better use this opportunity.
> > > This system has lots of RAM and disk to spare so I have no problem
> > > running this test in a VM in parallel to my work.
> > >
> > > It is not actually my system, it's a system that Luis has setup for
> > > stable XFS testing and gave me access to, so if the need arises
> > > you could get direct access to the system, but for now, I have no
> > > problem running the test for you.
> > >
> > > > Have you tested this patch (backport) in isolation
> > > > in your reproducer env or only in combination with other pending
> > > > backports?
> > > >
> > >
> > > I tested it on top of 5.10.109 + these 5 patches:
> > > https://github.com/amir73il/linux/commits/xfs-5.10.y-1
> > >
> > > I can test it in isolation if you like. Let me know if there are
> > > other forensics that you would like me to collect.
> > >
> >
> > Hm. Still no luck if I move to .109 and pull in those few patches. I
> > assume there's nothing else potentially interesting about the test env
> > other than the sparse file scratch dev (i.e., default mkfs options,
> > etc.)? If so and you can reliably reproduce, I suppose it couldn't hurt
> > to try and grab a tracepoint dump of the test when it fails (feel free
> > to send directly or upload somewhere as the list may punt it, and please
> > also include the dmesg output that goes along with it) and I can see if
> > that shows anything helpful.
> >
> > I think what we want to know initially is what error code we're
> > producing (-ENOSPC?) and where it originates, and from there we can
> > probably work out how the transaction might be dirty. I'm not sure a
> > trace dump will express that conclusively. If you wanted to increase the
> > odds of getting some useful information it might be helpful to stick a
> > few trace_printk() calls in the various trans cancel error paths out of
> > xfs_create() to determine whether it's the inode allocation attempt that
> > fails or the subsequent attempt to create the directory entry..
> >
> 
> The error (-ENOSPC) comes from this v5.10 code in xfs_dir_ialloc():
> 
>         if (!ialloc_context && !ip) {
>                 *ipp = NULL;
>                 return -ENOSPC;
>         }
> 
> Which theoretically might trip after xfs_ialloc() has marked the transaction
> dirty(?).
> 

Yeah, I think the first part that might have dirtied the transaction at
this point is fixing up the AGFL on a block allocation attempt. Ideally
the AG selection code would prevent taking this step for an AG that
can't satisfy an allocation, but realistically I'm not sure this
approach will ever work perfectly unless it separates out the extent
search algorithm from the prospect of dirtying the transaction and thus
committing to the allocation. This may not be trivial because the AGFL
fixup can require extent allocation itself.

It's not terribly surprising that limiting reuse of busy extents could
increase the likelihood of allocation failure, though I think most cases
should flush busy extents and retry. I wonder a bit whether changes to
the near mode allocation algorithm may have introduced a potential
regression in that regard.

> This specific code is gone with this cleanup series in v5.11:
> 
> https://lore.kernel.org/linux-xfs/20201209112820.114863-1-hsiangkao@redhat.com/
> 
> When the $SUBJECT patch is applied to v5.11.16 the test xfs/076 does not fail.
> 
> So either the $SUBJECT patch (from 5.12) is incompatible with v5.10 code
> or the cleanup series somehow managed to make my system not reproduce
> the bug anymore.
> 

I thought this was mostly refactoring and reworking the ugly retry
pattern in the former code vs. major functional changes. I think the
fundamental prospect of allocation failure still exists in current code,
but clearly your tests demonstrate some practical difference. Perhaps
there are some subtle logic changes in that rework that help prevent
this problem.

> I will assume the former and drop this patch from v5.10.y candidates.
> If you want me to continue to research the bug on v5.10 let me know
> what else you want me to check.
> 

I think that's reasonable. Realistically the bug fixed by this patch is
so old and long standing (v3.0?) I don't think it's terribly important
to pull it from v5.12 to v5.10.y unless real users started hitting it.
My curiosity around the cause is moreso to identify whether there's a
bug in mainline that needs fixing..

Brian

> Thanks,
> Amir.
>

Gao Xiang Jan. 11, 2023, 2:41 p.m. UTC | #18

Hi,

On 2022/5/26 19:34, Amir Goldstein wrote:
> On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote:
>>
>> On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
>>> On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
>>>> Freed extents are marked busy from the point the freeing transaction
>>>> commits until the associated CIL context is checkpointed to the log.
>>>> This prevents reuse and overwrite of recently freed blocks before
>>>> the changes are committed to disk, which can lead to corruption
>>>> after a crash. The exception to this rule is that metadata
>>>> allocation is allowed to reuse busy extents because metadata changes
>>>> are also logged.
>>>>
>>>> As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
>>>> has allowed modification or complete invalidation of outstanding
>>>> busy extents for metadata allocations. This implementation assumes
>>>> that use of the associated extent is imminent, which is not always
>>>> the case. For example, the trimmed extent might not satisfy the
>>>> minimum length of the allocation request, or the allocation
>>>> algorithm might be involved in a search for the optimal result based
>>>> on locality.
>>>>
>>>> generic/019 reproduces a corruption caused by this scenario. First,
>>>> a metadata block (usually a bmbt or symlink block) is freed from an
>>>> inode. A subsequent bmbt split on an unrelated inode attempts a near
>>>> mode allocation request that invalidates the busy block during the
>>>> search, but does not ultimately allocate it. Due to the busy state
>>>> invalidation, the block is no longer considered busy to subsequent
>>>> allocation. A direct I/O write request immediately allocates the
>>>> block and writes to it.
>>>

...

>>
> 
> Hi Brian,
> 
> This patch was one of my selected fixes to backport for v5.10.y.
> It has a very scary looking commit message and the change seems
> to be independent of any infrastructure changes(?).
> 
> The problem is that applying this patch to v5.10.y reliably reproduces
> this buffer corruption assertion [*] with test xfs/076.
> 
> This happens on the kdevops system that is using loop devices over
> sparse files inside qemu images. It does not reproduce on my small
> VM at home.
> 
> Normally, I would just drop this patch from the stable candidates queue
> and move on, but I thought you might be interested to investigate this
> reliable reproducer, because maybe this system exercises an error
> that is otherwise rare to hit.
> 
> It seemed weird to me that NOT reusing the extent would result in
> data corruption, but it could indicate that reusing the extent was masking
> the assertion and hiding another bug(?).
> 
> Can you think of another reason to explain the regression this fix
> introduces to 5.10.y?
> 
> Do you care to investigate this failure or shall I just move on?
> 
> Thanks,
> Amir.
> 
> [*]
> : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file
> fs/xfs/xfs_trans.c.  Caller xfs_create+0x22f/0x590 [xfs]
> : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G            E
>     5.10.109-xfs-2 #8
> : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> 04/01/2014
> : Call Trace:
> :  dump_stack+0x6d/0x88
> :  xfs_trans_cancel+0x17b/0x1a0 [xfs]
> :  xfs_create+0x22f/0x590 [xfs]
> :  xfs_generic_create+0x245/0x310 [xfs]
> :  ? d_splice_alias+0x13a/0x3c0
> :  path_openat+0xe3f/0x1080
> :  do_filp_open+0x93/0x100
> :  ? handle_mm_fault+0x148e/0x1690
> :  ? __check_object_size+0x162/0x180
> :  do_sys_openat2+0x228/0x2d0
> :  do_sys_open+0x4b/0x80
> :  do_syscall_64+0x33/0x80
> :  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> : RIP: 0033:0x7f36b02eff1e
> : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00
> 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d
> 00 f0 ff ff 0
> : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e
> : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c
> : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
> : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001
> : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file
> fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764
> : XFS (loop5): Corruption of in-memory data detected.  Shutting down filesystem
> : XFS (loop5): Please unmount the filesystem and rectify the problem(s)
> 

(...just for the record) We also encountered this issue but without commit
  xfs: don't reuse busy extents on extent trim
applied in our 5.10 codebase...

Need to find some time to look into that...

[  413.283300] XFS (loop1): Internal error xfs_trans_cancel at line 950 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x219/0x590 [xfs]
[  413.284295] CPU: 0 PID: 27484 Comm: touch Tainted: G            E     5.10.134-13.an8.x86_64 #1
[  413.284296] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
[  413.284297] Call Trace:
[  413.284314]  dump_stack+0x57/0x6e
[  413.284373]  xfs_trans_cancel+0xa3/0x110 [xfs]
[  413.284412]  xfs_create+0x219/0x590 [xfs]
[  413.284458]  xfs_generic_create+0x21f/0x2d0 [xfs]
[  413.284462]  path_openat+0xdee/0x1020
[  413.284464]  do_filp_open+0x80/0xd0
[  413.284467]  ? __check_object_size+0x16a/0x180
[  413.284469]  do_sys_openat2+0x207/0x2c0
[  413.284471]  do_sys_open+0x3b/0x60
[  413.284475]  do_syscall_64+0x33/0x40
[  413.284478]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
[  413.284481] RIP: 0033:0x7fe623920252
[  413.284482] Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 55 43 2a 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
[  413.284483] RSP: 002b:00007ffd7a38ca70 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
[  413.284485] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fe623920252
[  413.284486] RDX: 0000000000000941 RSI: 00007ffd7a38d79d RDI: 00000000ffffff9c
[  413.284487] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[  413.284488] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
[  413.284488] R13: 0000000000000001 R14: 00007ffd7a38d79d R15: 00007fe623bbf374
[  413.289856] XFS (loop1): xfs_do_force_shutdown(0x8) called from line 951 of file fs/xfs/xfs_trans.c. Return address = 000000003fe0b8ba
[  413.289858] XFS (loop1): Corruption of in-memory data detected.  Shutting down filesystem
[  413.290573] XFS (loop1): Please unmount the filesystem and rectify the problem(s)

Thanks,
Gao Xiang

Dave Chinner Jan. 11, 2023, 9:06 p.m. UTC | #19

On Wed, Jan 11, 2023 at 10:41:43PM +0800, Gao Xiang wrote:
> Hi,
> 
> On 2022/5/26 19:34, Amir Goldstein wrote:
> > On Tue, Feb 23, 2021 at 2:35 PM Brian Foster <bfoster@redhat.com> wrote:
> > > 
> > > On Mon, Feb 22, 2021 at 10:27:45AM -0800, Darrick J. Wong wrote:
> > > > On Mon, Feb 22, 2021 at 10:34:42AM -0500, Brian Foster wrote:
> > > > > Freed extents are marked busy from the point the freeing transaction
> > > > > commits until the associated CIL context is checkpointed to the log.
> > > > > This prevents reuse and overwrite of recently freed blocks before
> > > > > the changes are committed to disk, which can lead to corruption
> > > > > after a crash. The exception to this rule is that metadata
> > > > > allocation is allowed to reuse busy extents because metadata changes
> > > > > are also logged.
> > > > > 
> > > > > As of commit 97d3ac75e5e0 ("xfs: exact busy extent tracking"), XFS
> > > > > has allowed modification or complete invalidation of outstanding
> > > > > busy extents for metadata allocations. This implementation assumes
> > > > > that use of the associated extent is imminent, which is not always
> > > > > the case. For example, the trimmed extent might not satisfy the
> > > > > minimum length of the allocation request, or the allocation
> > > > > algorithm might be involved in a search for the optimal result based
> > > > > on locality.
> > > > > 
> > > > > generic/019 reproduces a corruption caused by this scenario. First,
> > > > > a metadata block (usually a bmbt or symlink block) is freed from an
> > > > > inode. A subsequent bmbt split on an unrelated inode attempts a near
> > > > > mode allocation request that invalidates the busy block during the
> > > > > search, but does not ultimately allocate it. Due to the busy state
> > > > > invalidation, the block is no longer considered busy to subsequent
> > > > > allocation. A direct I/O write request immediately allocates the
> > > > > block and writes to it.
> > > > 
> 
> ...
> 
> > > 
> > 
> > Hi Brian,
> > 
> > This patch was one of my selected fixes to backport for v5.10.y.
> > It has a very scary looking commit message and the change seems
> > to be independent of any infrastructure changes(?).
> > 
> > The problem is that applying this patch to v5.10.y reliably reproduces
> > this buffer corruption assertion [*] with test xfs/076.
> > 
> > This happens on the kdevops system that is using loop devices over
> > sparse files inside qemu images. It does not reproduce on my small
> > VM at home.
> > 
> > Normally, I would just drop this patch from the stable candidates queue
> > and move on, but I thought you might be interested to investigate this
> > reliable reproducer, because maybe this system exercises an error
> > that is otherwise rare to hit.
> > 
> > It seemed weird to me that NOT reusing the extent would result in
> > data corruption, but it could indicate that reusing the extent was masking
> > the assertion and hiding another bug(?).
> > 
> > Can you think of another reason to explain the regression this fix
> > introduces to 5.10.y?
> > 
> > Do you care to investigate this failure or shall I just move on?
> > 
> > Thanks,
> > Amir.
> > 
> > [*]
> > : XFS (loop5): Internal error xfs_trans_cancel at line 954 of file
> > fs/xfs/xfs_trans.c.  Caller xfs_create+0x22f/0x590 [xfs]
> > : CPU: 3 PID: 25481 Comm: touch Kdump: loaded Tainted: G            E
> >     5.10.109-xfs-2 #8
> > : Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
> > 04/01/2014
> > : Call Trace:
> > :  dump_stack+0x6d/0x88
> > :  xfs_trans_cancel+0x17b/0x1a0 [xfs]
> > :  xfs_create+0x22f/0x590 [xfs]
> > :  xfs_generic_create+0x245/0x310 [xfs]
> > :  ? d_splice_alias+0x13a/0x3c0
> > :  path_openat+0xe3f/0x1080
> > :  do_filp_open+0x93/0x100
> > :  ? handle_mm_fault+0x148e/0x1690
> > :  ? __check_object_size+0x162/0x180
> > :  do_sys_openat2+0x228/0x2d0
> > :  do_sys_open+0x4b/0x80
> > :  do_syscall_64+0x33/0x80
> > :  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > : RIP: 0033:0x7f36b02eff1e
> > : Code: 25 00 00 41 00 3d 00 00 41 00 74 48 48 8d 05 e9 57 0d 00 8b 00
> > 85 c0 75 69 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d
> > 00 f0 ff ff 0
> > : RSP: 002b:00007ffe7ef6ca10 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> > : RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f36b02eff1e
> > : RDX: 0000000000000941 RSI: 00007ffe7ef6ebfa RDI: 00000000ffffff9c
> > : RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> > : R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000002
> > : R13: 00007ffe7ef6ebfa R14: 0000000000000001 R15: 0000000000000001
> > : XFS (loop5): xfs_do_force_shutdown(0x8) called from line 955 of file
> > fs/xfs/xfs_trans.c. Return address = ffffffffc08f5764
> > : XFS (loop5): Corruption of in-memory data detected.  Shutting down filesystem
> > : XFS (loop5): Please unmount the filesystem and rectify the problem(s)
> > 
> 
> (...just for the record) We also encountered this issue but without commit
>  xfs: don't reuse busy extents on extent trim
> applied in our 5.10 codebase...
> 
> Need to find some time to look into that...
> 
> [  413.283300] XFS (loop1): Internal error xfs_trans_cancel at line 950 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x219/0x590 [xfs]
> [  413.284295] CPU: 0 PID: 27484 Comm: touch Tainted: G            E     5.10.134-13.an8.x86_64 #1
> [  413.284296] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 449e491 04/01/2014
> [  413.284297] Call Trace:
> [  413.284314]  dump_stack+0x57/0x6e
> [  413.284373]  xfs_trans_cancel+0xa3/0x110 [xfs]
> [  413.284412]  xfs_create+0x219/0x590 [xfs]
> [  413.284458]  xfs_generic_create+0x21f/0x2d0 [xfs]
> [  413.284462]  path_openat+0xdee/0x1020
> [  413.284464]  do_filp_open+0x80/0xd0
> [  413.284467]  ? __check_object_size+0x16a/0x180
> [  413.284469]  do_sys_openat2+0x207/0x2c0
> [  413.284471]  do_sys_open+0x3b/0x60
> [  413.284475]  do_syscall_64+0x33/0x40
> [  413.284478]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
> [  413.284481] RIP: 0033:0x7fe623920252
> [  413.284482] Code: 25 00 00 41 00 3d 00 00 41 00 74 4c 48 8d 05 55 43 2a 00 8b 00 85 c0 75 6d 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 0f 87 a2 00 00 00 48 8b 4c 24 28 64 48 33 0c 25
> [  413.284483] RSP: 002b:00007ffd7a38ca70 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
> [  413.284485] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007fe623920252
> [  413.284486] RDX: 0000000000000941 RSI: 00007ffd7a38d79d RDI: 00000000ffffff9c
> [  413.284487] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
> [  413.284488] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
> [  413.284488] R13: 0000000000000001 R14: 00007ffd7a38d79d R15: 00007fe623bbf374
> [  413.289856] XFS (loop1): xfs_do_force_shutdown(0x8) called from line 951 of file fs/xfs/xfs_trans.c. Return address = 000000003fe0b8ba
> [  413.289858] XFS (loop1): Corruption of in-memory data detected.  Shutting down filesystem
> [  413.290573] XFS (loop1): Please unmount the filesystem and rectify the problem(s)

This likely has nothing to do with the commit in this thread.

The shutdowns reported in this thread are *not metadata corruption*;
this typically occurs when ENOSPC has occurred during inode
creation after an AGF has been dirtied. i.e.
xfs_alloc_fix_freelist() has dirtied the selected AGF and AGFL doing
AGFL fixup, then not had enough space left to allocate the minlen
extent being requested. The allocation then cannot find space in any
other AG (because ENOSPC), and the inode allocation fails and the
transaction gets cancelled.

Cancelling a dirty transaction is a fatal error - we cannot recover
from it as the metadata state prior to the modification in the
transaction cannot be easily recovered. i.e. we cannot roll back the
changes that have been made, and hence we have to shut down the
filesystem. The reason for the shutdown is that the in-memory state
is not recoverable, hence it is considered corrupt. No actual
corruption has occurred, we're just in an unrecoverable situation
with partially modified metadata.

IOWs, this is a symptom of a buggy "can we allocate successfully in
this AG" check prior to attemptimg allocating and modifying
metadata, not that there is actual filesystem corruption occurring.

Cheers,

Dave.

xfs: don't reuse busy extents on extent trim

Commit Message

Comments

Patch