xfs: fix livelock in delayed allocation at ENOSPC

Message ID	20230421222440.2722482-1-david@fromorbit.com (mailing list archive)
State	Accepted
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> From: Dave Chinner <david@fromorbit.com> To: linux-xfs@vger.kernel.org Cc: bfoster@redhat.com Subject: [PATCH] xfs: fix livelock in delayed allocation at ENOSPC Date: Sat, 22 Apr 2023 08:24:40 +1000 Message-Id: <20230421222440.2722482-1-david@fromorbit.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	xfs: fix livelock in delayed allocation at ENOSPC \| expand xfs: fix livelock in delayed allocation at ENOSPC

Dave Chinner April 21, 2023, 10:24 p.m. UTC

From: Dave Chinner <dchinner@redhat.com>

On a filesystem with a non-zero stripe unit and a large sequential
write, delayed allocation will set a minimum allocation length of
the stripe unit. If allocation fails because there are no extents
long enough for an aligned minlen allocation, it is supposed to
fall back to unaligned allocation which allows single block extents
to be allocated.

When the allocator code was rewritting in the 6.3 cycle, this
fallback was broken - the old code used args->fsbno as the both the
allocation target and the allocation result, the new code passes the
target as a separate parameter. The conversion didn't handle the
aligned->unaligned fallback path correctly - it reset args->fsbno to
the target fsbno on failure which broke allocation failure detection
in the high level code and so it never fell back to unaligned
allocations.

This resulted in a loop in writeback trying to allocate an aligned
block, getting a false positive success, trying to insert the result
in the BMBT. This did nothing because the extent already was in the
BMBT (merge results in an unchanged extent) and so it returned the
prior extent to the conversion code as the current iomap.

Because the iomap returned didn't cover the offset we tried to map,
xfs_convert_blocks() then retries the allocation, which fails in the
same way and now we have a livelock.

Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 1 -
 1 file changed, 1 deletion(-)

Darrick J. Wong April 22, 2023, 3:53 a.m. UTC | #1

On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> On a filesystem with a non-zero stripe unit and a large sequential
> write, delayed allocation will set a minimum allocation length of
> the stripe unit. If allocation fails because there are no extents
> long enough for an aligned minlen allocation, it is supposed to
> fall back to unaligned allocation which allows single block extents
> to be allocated.
> 
> When the allocator code was rewritting in the 6.3 cycle, this
> fallback was broken - the old code used args->fsbno as the both the
> allocation target and the allocation result, the new code passes the
> target as a separate parameter. The conversion didn't handle the
> aligned->unaligned fallback path correctly - it reset args->fsbno to
> the target fsbno on failure which broke allocation failure detection
> in the high level code and so it never fell back to unaligned
> allocations.
> 
> This resulted in a loop in writeback trying to allocate an aligned
> block, getting a false positive success, trying to insert the result
> in the BMBT. This did nothing because the extent already was in the
> BMBT (merge results in an unchanged extent) and so it returned the
> prior extent to the conversion code as the current iomap.
> 
> Because the iomap returned didn't cover the offset we tried to map,
> xfs_convert_blocks() then retries the allocation, which fails in the
> same way and now we have a livelock.
> 
> Reported-by: Brian Foster <bfoster@redhat.com>
> Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Will give this one a spin through the test system over the weekend.

In the meantime, can one of you come up with a reproducer?  From the
description, it doesn't sound like that should be too hard -- mount with
no stripe unit set, fragment the free space, mount with a stripe unit
set, then run the fs out of space?

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 1a4e446194dd..b512de0540d5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof(
>  	 * original non-aligned state so the caller can proceed on allocation
>  	 * failure as if this function was never called.
>  	 */
> -	args->fsbno = ap->blkno;
>  	args->alignment = 1;
>  	return 0;
>  }
> -- 
> 2.39.2
>

Dave Chinner April 23, 2023, 10:14 p.m. UTC | #2

On Fri, Apr 21, 2023 at 08:53:00PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > On a filesystem with a non-zero stripe unit and a large sequential
> > write, delayed allocation will set a minimum allocation length of
> > the stripe unit. If allocation fails because there are no extents
> > long enough for an aligned minlen allocation, it is supposed to
> > fall back to unaligned allocation which allows single block extents
> > to be allocated.
> > 
> > When the allocator code was rewritting in the 6.3 cycle, this
> > fallback was broken - the old code used args->fsbno as the both the
> > allocation target and the allocation result, the new code passes the
> > target as a separate parameter. The conversion didn't handle the
> > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > the target fsbno on failure which broke allocation failure detection
> > in the high level code and so it never fell back to unaligned
> > allocations.
> > 
> > This resulted in a loop in writeback trying to allocate an aligned
> > block, getting a false positive success, trying to insert the result
> > in the BMBT. This did nothing because the extent already was in the
> > BMBT (merge results in an unchanged extent) and so it returned the
> > prior extent to the conversion code as the current iomap.
> > 
> > Because the iomap returned didn't cover the offset we tried to map,
> > xfs_convert_blocks() then retries the allocation, which fails in the
> > same way and now we have a livelock.
> > 
> > Reported-by: Brian Foster <bfoster@redhat.com>
> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Will give this one a spin through the test system over the weekend.
> 
> In the meantime, can one of you come up with a reproducer?  From the
> description, it doesn't sound like that should be too hard -- mount with
> no stripe unit set, fragment the free space, mount with a stripe unit
> set, then run the fs out of space?

No need.

# ./run_check --run-opts "-s xfs_align -g enospc"
Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_align -g enospc
SECTION       -- xfs_align
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test2 6.3.0-rc6-dgc+ #1779 SMP PREEMPT_DYNAMIC Fri Apr 14 11:24:18 AEST 2023
MKFS_OPTIONS  -- -f -m rmapbt=1 -dsu=128k,sw=2 /dev/vdb
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb /mnt/scratch

generic/015 1s ... 

Hangs immediately on the first ENOSPC test.

IOWs, up to this point, no ENOSPC testing had been done on stripe
aligned filesystems. A hole in the (ever expanding) test matrix we
need to run...

-Dave.

Brian Foster April 24, 2023, 2:27 p.m. UTC | #3

On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> On a filesystem with a non-zero stripe unit and a large sequential
> write, delayed allocation will set a minimum allocation length of
> the stripe unit. If allocation fails because there are no extents
> long enough for an aligned minlen allocation, it is supposed to
> fall back to unaligned allocation which allows single block extents
> to be allocated.
> 
> When the allocator code was rewritting in the 6.3 cycle, this
> fallback was broken - the old code used args->fsbno as the both the
> allocation target and the allocation result, the new code passes the
> target as a separate parameter. The conversion didn't handle the
> aligned->unaligned fallback path correctly - it reset args->fsbno to
> the target fsbno on failure which broke allocation failure detection
> in the high level code and so it never fell back to unaligned
> allocations.
> 
> This resulted in a loop in writeback trying to allocate an aligned
> block, getting a false positive success, trying to insert the result
> in the BMBT. This did nothing because the extent already was in the
> BMBT (merge results in an unchanged extent) and so it returned the
> prior extent to the conversion code as the current iomap.
> 
> Because the iomap returned didn't cover the offset we tried to map,
> xfs_convert_blocks() then retries the allocation, which fails in the
> same way and now we have a livelock.
> 
> Reported-by: Brian Foster <bfoster@redhat.com>
> Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Problem solved, thanks. FWIW:

Tested-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_bmap.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 1a4e446194dd..b512de0540d5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof(
>  	 * original non-aligned state so the caller can proceed on allocation
>  	 * failure as if this function was never called.
>  	 */
> -	args->fsbno = ap->blkno;
>  	args->alignment = 1;
>  	return 0;
>  }
> -- 
> 2.39.2
>

Darrick J. Wong April 25, 2023, 3:20 p.m. UTC | #4

On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> On a filesystem with a non-zero stripe unit and a large sequential
> write, delayed allocation will set a minimum allocation length of
> the stripe unit. If allocation fails because there are no extents
> long enough for an aligned minlen allocation, it is supposed to
> fall back to unaligned allocation which allows single block extents
> to be allocated.
> 
> When the allocator code was rewritting in the 6.3 cycle, this
> fallback was broken - the old code used args->fsbno as the both the
> allocation target and the allocation result, the new code passes the
> target as a separate parameter. The conversion didn't handle the
> aligned->unaligned fallback path correctly - it reset args->fsbno to
> the target fsbno on failure which broke allocation failure detection
> in the high level code and so it never fell back to unaligned
> allocations.
> 
> This resulted in a loop in writeback trying to allocate an aligned
> block, getting a false positive success, trying to insert the result
> in the BMBT. This did nothing because the extent already was in the
> BMBT (merge results in an unchanged extent) and so it returned the
> prior extent to the conversion code as the current iomap.
> 
> Because the iomap returned didn't cover the offset we tried to map,
> xfs_convert_blocks() then retries the allocation, which fails in the
> same way and now we have a livelock.
> 
> Reported-by: Brian Foster <bfoster@redhat.com>
> Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Insofar as this has revealed a whole ton of *more* problems in mkfs,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Specifically: if I set su=128k,sw=4, some tests will try to format a
512M filesystem.  This results in an 8-AG filesystem with a log that
fills up almost but not all of an entire AG.  The AG then ends up with
an empty bnobt and an empty AGFL, and 25 missing blocks...

...oh and the new test vms that run this config failed to finish for
some reason.  Sigh.

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 1a4e446194dd..b512de0540d5 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof(
>  	 * original non-aligned state so the caller can proceed on allocation
>  	 * failure as if this function was never called.
>  	 */
> -	args->fsbno = ap->blkno;
>  	args->alignment = 1;
>  	return 0;
>  }
> -- 
> 2.39.2
>

Dave Chinner April 26, 2023, 11:01 p.m. UTC | #5

On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > On a filesystem with a non-zero stripe unit and a large sequential
> > write, delayed allocation will set a minimum allocation length of
> > the stripe unit. If allocation fails because there are no extents
> > long enough for an aligned minlen allocation, it is supposed to
> > fall back to unaligned allocation which allows single block extents
> > to be allocated.
> > 
> > When the allocator code was rewritting in the 6.3 cycle, this
> > fallback was broken - the old code used args->fsbno as the both the
> > allocation target and the allocation result, the new code passes the
> > target as a separate parameter. The conversion didn't handle the
> > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > the target fsbno on failure which broke allocation failure detection
> > in the high level code and so it never fell back to unaligned
> > allocations.
> > 
> > This resulted in a loop in writeback trying to allocate an aligned
> > block, getting a false positive success, trying to insert the result
> > in the BMBT. This did nothing because the extent already was in the
> > BMBT (merge results in an unchanged extent) and so it returned the
> > prior extent to the conversion code as the current iomap.
> > 
> > Because the iomap returned didn't cover the offset we tried to map,
> > xfs_convert_blocks() then retries the allocation, which fails in the
> > same way and now we have a livelock.
> > 
> > Reported-by: Brian Foster <bfoster@redhat.com>
> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Insofar as this has revealed a whole ton of *more* problems in mkfs,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks, I've added this to for-next and I'll include it in the pull
req to Linus tomorrow because I don't want expose everyone using
merge window kernels to this ENOSPC issue even for a short while.

> Specifically: if I set su=128k,sw=4, some tests will try to format a
> 512M filesystem.  This results in an 8-AG filesystem with a log that
> fills up almost but not all of an entire AG.  The AG then ends up with
> an empty bnobt and an empty AGFL, and 25 missing blocks...

I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
see failures due to mkfs warnings like this:

    +Warning: AG size is a multiple of stripe width.  This can cause performance
    +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
    +an AG size that is one stripe unit smaller or larger, for example 129248.

> ...oh and the new test vms that run this config failed to finish for
> some reason.  Sigh.

Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
the xfs_repair process allows everything to keep going. I suspect
it's a prefetch race/deadlock...

-Dave.

Darrick J. Wong April 26, 2023, 11:38 p.m. UTC | #6

On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote:
> On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > On a filesystem with a non-zero stripe unit and a large sequential
> > > write, delayed allocation will set a minimum allocation length of
> > > the stripe unit. If allocation fails because there are no extents
> > > long enough for an aligned minlen allocation, it is supposed to
> > > fall back to unaligned allocation which allows single block extents
> > > to be allocated.
> > > 
> > > When the allocator code was rewritting in the 6.3 cycle, this
> > > fallback was broken - the old code used args->fsbno as the both the
> > > allocation target and the allocation result, the new code passes the
> > > target as a separate parameter. The conversion didn't handle the
> > > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > > the target fsbno on failure which broke allocation failure detection
> > > in the high level code and so it never fell back to unaligned
> > > allocations.
> > > 
> > > This resulted in a loop in writeback trying to allocate an aligned
> > > block, getting a false positive success, trying to insert the result
> > > in the BMBT. This did nothing because the extent already was in the
> > > BMBT (merge results in an unchanged extent) and so it returned the
> > > prior extent to the conversion code as the current iomap.
> > > 
> > > Because the iomap returned didn't cover the offset we tried to map,
> > > xfs_convert_blocks() then retries the allocation, which fails in the
> > > same way and now we have a livelock.
> > > 
> > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > 
> > Insofar as this has revealed a whole ton of *more* problems in mkfs,
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> Thanks, I've added this to for-next and I'll include it in the pull
> req to Linus tomorrow because I don't want expose everyone using
> merge window kernels to this ENOSPC issue even for a short while.
> 
> > Specifically: if I set su=128k,sw=4, some tests will try to format a
> > 512M filesystem.  This results in an 8-AG filesystem with a log that
> > fills up almost but not all of an entire AG.  The AG then ends up with
> > an empty bnobt and an empty AGFL, and 25 missing blocks...
> 
> I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
> see failures due to mkfs warnings like this:
> 
>     +Warning: AG size is a multiple of stripe width.  This can cause performance
>     +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
>     +an AG size that is one stripe unit smaller or larger, for example 129248.

Yeah, I noticed that one, and am testing a patch to quiet down mkfs a
little bit.

I also caught a bug in the AG formatting code where the bnobt gets
written out with zero records if the log happens to start beyond
m_ag_prealloc_size and end at EOAG.

I also noticed that the percpu inodegc workers occasionally run on the
wrong CPU, but only on arm.  Tonight I intend to test a fix for that...

...but I also have been tracking a fix for an issue where
xfs_inodegc_stop races with either the reclaim inodegc kicker or with an
already set-up delayed work timer, with the end result that
drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker
itself) tries to queue_work the inodegc worker to the draining
workqueue, and we get a kernel bug message and the fs livelocks.

I've also been trying to fix that problem that Ritesh mentioned months
ago where if we manage to mount the fs cleanly but there are unlinked
inodes, we'll eventually fall over when the incore unlinked list fails
to find those lingering unlinked inodes.

I also added a su=128k,sw=4 config to the fstests fleet and am now
trying to fix all the fstests bugs that produce incorrect test failures.

> > ...oh and the new test vms that run this config failed to finish for
> > some reason.  Sigh.
> 
> Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
> the xfs_repair process allows everything to keep going. I suspect
> it's a prefetch race/deadlock...

<nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock
where the pthread mutex says the lock is owned by a thread that is no
longer running.

--D

> -Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner April 27, 2023, 12:11 a.m. UTC | #7

On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote:
> > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > On a filesystem with a non-zero stripe unit and a large sequential
> > > > write, delayed allocation will set a minimum allocation length of
> > > > the stripe unit. If allocation fails because there are no extents
> > > > long enough for an aligned minlen allocation, it is supposed to
> > > > fall back to unaligned allocation which allows single block extents
> > > > to be allocated.
> > > > 
> > > > When the allocator code was rewritting in the 6.3 cycle, this
> > > > fallback was broken - the old code used args->fsbno as the both the
> > > > allocation target and the allocation result, the new code passes the
> > > > target as a separate parameter. The conversion didn't handle the
> > > > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > > > the target fsbno on failure which broke allocation failure detection
> > > > in the high level code and so it never fell back to unaligned
> > > > allocations.
> > > > 
> > > > This resulted in a loop in writeback trying to allocate an aligned
> > > > block, getting a false positive success, trying to insert the result
> > > > in the BMBT. This did nothing because the extent already was in the
> > > > BMBT (merge results in an unchanged extent) and so it returned the
> > > > prior extent to the conversion code as the current iomap.
> > > > 
> > > > Because the iomap returned didn't cover the offset we tried to map,
> > > > xfs_convert_blocks() then retries the allocation, which fails in the
> > > > same way and now we have a livelock.
> > > > 
> > > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Insofar as this has revealed a whole ton of *more* problems in mkfs,
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > 
> > Thanks, I've added this to for-next and I'll include it in the pull
> > req to Linus tomorrow because I don't want expose everyone using
> > merge window kernels to this ENOSPC issue even for a short while.
> > 
> > > Specifically: if I set su=128k,sw=4, some tests will try to format a
> > > 512M filesystem.  This results in an 8-AG filesystem with a log that
> > > fills up almost but not all of an entire AG.  The AG then ends up with
> > > an empty bnobt and an empty AGFL, and 25 missing blocks...
> > 
> > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
> > see failures due to mkfs warnings like this:
> > 
> >     +Warning: AG size is a multiple of stripe width.  This can cause performance
> >     +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
> >     +an AG size that is one stripe unit smaller or larger, for example 129248.
> 
> Yeah, I noticed that one, and am testing a patch to quiet down mkfs a
> little bit.
> 
> I also caught a bug in the AG formatting code where the bnobt gets
> written out with zero records if the log happens to start beyond
> m_ag_prealloc_size and end at EOAG.
> 
> I also noticed that the percpu inodegc workers occasionally run on the
> wrong CPU, but only on arm.  Tonight I intend to test a fix for that...

Whacky. :/

> ...but I also have been tracking a fix for an issue where
> xfs_inodegc_stop races with either the reclaim inodegc kicker or with an
> already set-up delayed work timer, with the end result that
> drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker
> itself) tries to queue_work the inodegc worker to the draining
> workqueue, and we get a kernel bug message and the fs livelocks.

Yes, I've noticed that but not had time to fix it - disabling the
inodegc whilst stuff is still in progress after checking inodegc is
enabled is racy...

> I've also been trying to fix that problem that Ritesh mentioned months
> ago where if we manage to mount the fs cleanly but there are unlinked
> inodes, we'll eventually fall over when the incore unlinked list fails
> to find those lingering unlinked inodes.

Right, I also haven't had time to get to that either. IIRC it fails
to clear the bucket because we don't feed the error from the inodegc
context back to the recovery code.

> I also added a su=128k,sw=4 config to the fstests fleet and am now
> trying to fix all the fstests bugs that produce incorrect test failures.

The other thing I noticed is a couple of the FIEMAP tests fail
because they find data blocks where they expect holes such as:

generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad)
    --- tests/generic/225.out   2022-12-21 15:53:25.479044361 +1100
    +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad      2023-04-26 04:24:31.426016818 +1000
    @@ -1,3 +1,79 @@
     QA output created by 225
     fiemap run without preallocation, with sync
    +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35
    +ERROR: found an allocated extent where a hole should be: 35
    +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH'
    +logical: [      27..      27] phys:       67..      67 flags: 0x000 tot: 1
    +logical: [      29..      31] phys:       69..      71 flags: 0x000 tot: 3
    ...
    (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad'  to see the entire diff)

I haven't looked into this yet, but nothing is reporting data
corruptions so I suspect it's just the stripe aligned allocation
leaving unwritten extents in places the test is expecting holes to
exist...

> > > ...oh and the new test vms that run this config failed to finish for
> > > some reason.  Sigh.
> > 
> > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
> > the xfs_repair process allows everything to keep going. I suspect
> > it's a prefetch race/deadlock...
> 
> <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock
> where the pthread mutex says the lock is owned by a thread that is no
> longer running.

So we leaked a buffer lock somewhere?

Cheers,

Dave.

Darrick J. Wong April 27, 2023, 12:53 a.m. UTC | #8

On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote:
> On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote:
> > > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote:
> > > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > On a filesystem with a non-zero stripe unit and a large sequential
> > > > > write, delayed allocation will set a minimum allocation length of
> > > > > the stripe unit. If allocation fails because there are no extents
> > > > > long enough for an aligned minlen allocation, it is supposed to
> > > > > fall back to unaligned allocation which allows single block extents
> > > > > to be allocated.
> > > > > 
> > > > > When the allocator code was rewritting in the 6.3 cycle, this
> > > > > fallback was broken - the old code used args->fsbno as the both the
> > > > > allocation target and the allocation result, the new code passes the
> > > > > target as a separate parameter. The conversion didn't handle the
> > > > > aligned->unaligned fallback path correctly - it reset args->fsbno to
> > > > > the target fsbno on failure which broke allocation failure detection
> > > > > in the high level code and so it never fell back to unaligned
> > > > > allocations.
> > > > > 
> > > > > This resulted in a loop in writeback trying to allocate an aligned
> > > > > block, getting a false positive success, trying to insert the result
> > > > > in the BMBT. This did nothing because the extent already was in the
> > > > > BMBT (merge results in an unchanged extent) and so it returned the
> > > > > prior extent to the conversion code as the current iomap.
> > > > > 
> > > > > Because the iomap returned didn't cover the offset we tried to map,
> > > > > xfs_convert_blocks() then retries the allocation, which fails in the
> > > > > same way and now we have a livelock.
> > > > > 
> > > > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()")
> > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > Insofar as this has revealed a whole ton of *more* problems in mkfs,
> > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Thanks, I've added this to for-next and I'll include it in the pull
> > > req to Linus tomorrow because I don't want expose everyone using
> > > merge window kernels to this ENOSPC issue even for a short while.
> > > 
> > > > Specifically: if I set su=128k,sw=4, some tests will try to format a
> > > > 512M filesystem.  This results in an 8-AG filesystem with a log that
> > > > fills up almost but not all of an entire AG.  The AG then ends up with
> > > > an empty bnobt and an empty AGFL, and 25 missing blocks...
> > > 
> > > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I
> > > see failures due to mkfs warnings like this:
> > > 
> > >     +Warning: AG size is a multiple of stripe width.  This can cause performance
> > >     +problems by aligning all AGs on the same disk.  To avoid this, run mkfs with
> > >     +an AG size that is one stripe unit smaller or larger, for example 129248.
> > 
> > Yeah, I noticed that one, and am testing a patch to quiet down mkfs a
> > little bit.
> > 
> > I also caught a bug in the AG formatting code where the bnobt gets
> > written out with zero records if the log happens to start beyond
> > m_ag_prealloc_size and end at EOAG.
> > 
> > I also noticed that the percpu inodegc workers occasionally run on the
> > wrong CPU, but only on arm.  Tonight I intend to test a fix for that...
> 
> Whacky. :/
> 
> > ...but I also have been tracking a fix for an issue where
> > xfs_inodegc_stop races with either the reclaim inodegc kicker or with an
> > already set-up delayed work timer, with the end result that
> > drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker
> > itself) tries to queue_work the inodegc worker to the draining
> > workqueue, and we get a kernel bug message and the fs livelocks.
> 
> Yes, I've noticed that but not had time to fix it - disabling the
> inodegc whilst stuff is still in progress after checking inodegc is
> enabled is racy...
> 
> > I've also been trying to fix that problem that Ritesh mentioned months
> > ago where if we manage to mount the fs cleanly but there are unlinked
> > inodes, we'll eventually fall over when the incore unlinked list fails
> > to find those lingering unlinked inodes.
> 
> Right, I also haven't had time to get to that either. IIRC it fails
> to clear the bucket because we don't feed the error from the inodegc
> context back to the recovery code.
> 
> > I also added a su=128k,sw=4 config to the fstests fleet and am now
> > trying to fix all the fstests bugs that produce incorrect test failures.
> 
> The other thing I noticed is a couple of the FIEMAP tests fail
> because they find data blocks where they expect holes such as:
> 
> generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad)
>     --- tests/generic/225.out   2022-12-21 15:53:25.479044361 +1100
>     +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad      2023-04-26 04:24:31.426016818 +1000
>     @@ -1,3 +1,79 @@
>      QA output created by 225
>      fiemap run without preallocation, with sync
>     +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35
>     +ERROR: found an allocated extent where a hole should be: 35
>     +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH'
>     +logical: [      27..      27] phys:       67..      67 flags: 0x000 tot: 1
>     +logical: [      29..      31] phys:       69..      71 flags: 0x000 tot: 3
>     ...
>     (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad'  to see the entire diff)
> 
> I haven't looked into this yet, but nothing is reporting data
> corruptions so I suspect it's just the stripe aligned allocation
> leaving unwritten extents in places the test is expecting holes to
> exist...

That's the FIEMAP tester program not expecting that areas of the file
that it didn't write to can have unwritten extents mapped.  I'm testing
patches to fix all that tonight too.  If I can ever get these %#@%)#%!!!
orchestration scripts to work correctly.

> > > > ...oh and the new test vms that run this config failed to finish for
> > > > some reason.  Sigh.
> > > 
> > > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing
> > > the xfs_repair process allows everything to keep going. I suspect
> > > it's a prefetch race/deadlock...
> > 
> > <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock
> > where the pthread mutex says the lock is owned by a thread that is no
> > longer running.
> 
> So we leaked a buffer lock somewhere?

Yep.  It also shows up every now and then in the fstests that fuzz an
entire directory block and see if repair will fix it.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner April 27, 2023, 5:26 a.m. UTC | #9

On Wed, Apr 26, 2023 at 05:53:33PM -0700, Darrick J. Wong wrote:
> On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote:
> > On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote:
> > > I also added a su=128k,sw=4 config to the fstests fleet and am now
> > > trying to fix all the fstests bugs that produce incorrect test failures.
> > 
> > The other thing I noticed is a couple of the FIEMAP tests fail
> > because they find data blocks where they expect holes such as:
> > 
> > generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad)
> >     --- tests/generic/225.out   2022-12-21 15:53:25.479044361 +1100
> >     +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad      2023-04-26 04:24:31.426016818 +1000
> >     @@ -1,3 +1,79 @@
> >      QA output created by 225
> >      fiemap run without preallocation, with sync
> >     +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35
> >     +ERROR: found an allocated extent where a hole should be: 35
> >     +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH'
> >     +logical: [      27..      27] phys:       67..      67 flags: 0x000 tot: 1
> >     +logical: [      29..      31] phys:       69..      71 flags: 0x000 tot: 3
> >     ...
> >     (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad'  to see the entire diff)
> > 
> > I haven't looked into this yet, but nothing is reporting data
> > corruptions so I suspect it's just the stripe aligned allocation
> > leaving unwritten extents in places the test is expecting holes to
> > exist...
> 
> That's the FIEMAP tester program not expecting that areas of the file
> that it didn't write to can have unwritten extents mapped.  I'm testing
> patches to fix all that tonight too.  If I can ever get these %#@%)#%!!!
> orchestration scripts to work correctly.

OK.

FWIW, I've just found another bug in the stripe aligned allocation
at EOF that is triggered by the filestreams code hitting ENOSPC
conditions. xfs/170 seems to hit it fairly reliably - it's marking
args->pag as NULL and not resetting the caller pag correctly and the
high level filestreams failure code is expecting args->pag to be set
because it owns the reference...

I hope to have a fix for that one on the list this afternoon....

-Dave.

Darrick J. Wong April 27, 2023, 3:07 p.m. UTC | #10

On Thu, Apr 27, 2023 at 03:26:00PM +1000, Dave Chinner wrote:
> On Wed, Apr 26, 2023 at 05:53:33PM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote:
> > > On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote:
> > > > I also added a su=128k,sw=4 config to the fstests fleet and am now
> > > > trying to fix all the fstests bugs that produce incorrect test failures.
> > > 
> > > The other thing I noticed is a couple of the FIEMAP tests fail
> > > because they find data blocks where they expect holes such as:
> > > 
> > > generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad)
> > >     --- tests/generic/225.out   2022-12-21 15:53:25.479044361 +1100
> > >     +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad      2023-04-26 04:24:31.426016818 +1000
> > >     @@ -1,3 +1,79 @@
> > >      QA output created by 225
> > >      fiemap run without preallocation, with sync
> > >     +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35
> > >     +ERROR: found an allocated extent where a hole should be: 35
> > >     +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH'
> > >     +logical: [      27..      27] phys:       67..      67 flags: 0x000 tot: 1
> > >     +logical: [      29..      31] phys:       69..      71 flags: 0x000 tot: 3
> > >     ...
> > >     (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad'  to see the entire diff)
> > > 
> > > I haven't looked into this yet, but nothing is reporting data
> > > corruptions so I suspect it's just the stripe aligned allocation
> > > leaving unwritten extents in places the test is expecting holes to
> > > exist...
> > 
> > That's the FIEMAP tester program not expecting that areas of the file
> > that it didn't write to can have unwritten extents mapped.  I'm testing
> > patches to fix all that tonight too.  If I can ever get these %#@%)#%!!!
> > orchestration scripts to work correctly.
> 
> OK.
> 
> FWIW, I've just found another bug in the stripe aligned allocation
> at EOF that is triggered by the filestreams code hitting ENOSPC
> conditions. xfs/170 seems to hit it fairly reliably - it's marking
> args->pag as NULL and not resetting the caller pag correctly and the
> high level filestreams failure code is expecting args->pag to be set
> because it owns the reference...
> 
> I hope to have a fix for that one on the list this afternoon....

Oh, yeah, I hit that one too.  I'll send out my fixes after the ext4
concall and we can sync up on that.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

xfs: fix livelock in delayed allocation at ENOSPC

Commit Message

Comments

Patch