Message ID | 20230421222440.2722482-1-david@fromorbit.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | xfs: fix livelock in delayed allocation at ENOSPC | expand |
On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > On a filesystem with a non-zero stripe unit and a large sequential > write, delayed allocation will set a minimum allocation length of > the stripe unit. If allocation fails because there are no extents > long enough for an aligned minlen allocation, it is supposed to > fall back to unaligned allocation which allows single block extents > to be allocated. > > When the allocator code was rewritting in the 6.3 cycle, this > fallback was broken - the old code used args->fsbno as the both the > allocation target and the allocation result, the new code passes the > target as a separate parameter. The conversion didn't handle the > aligned->unaligned fallback path correctly - it reset args->fsbno to > the target fsbno on failure which broke allocation failure detection > in the high level code and so it never fell back to unaligned > allocations. > > This resulted in a loop in writeback trying to allocate an aligned > block, getting a false positive success, trying to insert the result > in the BMBT. This did nothing because the extent already was in the > BMBT (merge results in an unchanged extent) and so it returned the > prior extent to the conversion code as the current iomap. > > Because the iomap returned didn't cover the offset we tried to map, > xfs_convert_blocks() then retries the allocation, which fails in the > same way and now we have a livelock. > > Reported-by: Brian Foster <bfoster@redhat.com> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > Signed-off-by: Dave Chinner <dchinner@redhat.com> Will give this one a spin through the test system over the weekend. In the meantime, can one of you come up with a reproducer? From the description, it doesn't sound like that should be too hard -- mount with no stripe unit set, fragment the free space, mount with a stripe unit set, then run the fs out of space? --D > --- > fs/xfs/libxfs/xfs_bmap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 1a4e446194dd..b512de0540d5 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof( > * original non-aligned state so the caller can proceed on allocation > * failure as if this function was never called. > */ > - args->fsbno = ap->blkno; > args->alignment = 1; > return 0; > } > -- > 2.39.2 >
On Fri, Apr 21, 2023 at 08:53:00PM -0700, Darrick J. Wong wrote: > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > On a filesystem with a non-zero stripe unit and a large sequential > > write, delayed allocation will set a minimum allocation length of > > the stripe unit. If allocation fails because there are no extents > > long enough for an aligned minlen allocation, it is supposed to > > fall back to unaligned allocation which allows single block extents > > to be allocated. > > > > When the allocator code was rewritting in the 6.3 cycle, this > > fallback was broken - the old code used args->fsbno as the both the > > allocation target and the allocation result, the new code passes the > > target as a separate parameter. The conversion didn't handle the > > aligned->unaligned fallback path correctly - it reset args->fsbno to > > the target fsbno on failure which broke allocation failure detection > > in the high level code and so it never fell back to unaligned > > allocations. > > > > This resulted in a loop in writeback trying to allocate an aligned > > block, getting a false positive success, trying to insert the result > > in the BMBT. This did nothing because the extent already was in the > > BMBT (merge results in an unchanged extent) and so it returned the > > prior extent to the conversion code as the current iomap. > > > > Because the iomap returned didn't cover the offset we tried to map, > > xfs_convert_blocks() then retries the allocation, which fails in the > > same way and now we have a livelock. > > > > Reported-by: Brian Foster <bfoster@redhat.com> > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Will give this one a spin through the test system over the weekend. > > In the meantime, can one of you come up with a reproducer? From the > description, it doesn't sound like that should be too hard -- mount with > no stripe unit set, fragment the free space, mount with a stripe unit > set, then run the fs out of space? No need. # ./run_check --run-opts "-s xfs_align -g enospc" Running: MOUNT_OPTIONS= ./check -R xunit -b -s xfs_align -g enospc SECTION -- xfs_align FSTYP -- xfs (debug) PLATFORM -- Linux/x86_64 test2 6.3.0-rc6-dgc+ #1779 SMP PREEMPT_DYNAMIC Fri Apr 14 11:24:18 AEST 2023 MKFS_OPTIONS -- -f -m rmapbt=1 -dsu=128k,sw=2 /dev/vdb MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/vdb /mnt/scratch generic/015 1s ... Hangs immediately on the first ENOSPC test. IOWs, up to this point, no ENOSPC testing had been done on stripe aligned filesystems. A hole in the (ever expanding) test matrix we need to run... -Dave.
On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > On a filesystem with a non-zero stripe unit and a large sequential > write, delayed allocation will set a minimum allocation length of > the stripe unit. If allocation fails because there are no extents > long enough for an aligned minlen allocation, it is supposed to > fall back to unaligned allocation which allows single block extents > to be allocated. > > When the allocator code was rewritting in the 6.3 cycle, this > fallback was broken - the old code used args->fsbno as the both the > allocation target and the allocation result, the new code passes the > target as a separate parameter. The conversion didn't handle the > aligned->unaligned fallback path correctly - it reset args->fsbno to > the target fsbno on failure which broke allocation failure detection > in the high level code and so it never fell back to unaligned > allocations. > > This resulted in a loop in writeback trying to allocate an aligned > block, getting a false positive success, trying to insert the result > in the BMBT. This did nothing because the extent already was in the > BMBT (merge results in an unchanged extent) and so it returned the > prior extent to the conversion code as the current iomap. > > Because the iomap returned didn't cover the offset we tried to map, > xfs_convert_blocks() then retries the allocation, which fails in the > same way and now we have a livelock. > > Reported-by: Brian Foster <bfoster@redhat.com> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > Signed-off-by: Dave Chinner <dchinner@redhat.com> > --- Problem solved, thanks. FWIW: Tested-by: Brian Foster <bfoster@redhat.com> > fs/xfs/libxfs/xfs_bmap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 1a4e446194dd..b512de0540d5 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof( > * original non-aligned state so the caller can proceed on allocation > * failure as if this function was never called. > */ > - args->fsbno = ap->blkno; > args->alignment = 1; > return 0; > } > -- > 2.39.2 >
On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@redhat.com> > > On a filesystem with a non-zero stripe unit and a large sequential > write, delayed allocation will set a minimum allocation length of > the stripe unit. If allocation fails because there are no extents > long enough for an aligned minlen allocation, it is supposed to > fall back to unaligned allocation which allows single block extents > to be allocated. > > When the allocator code was rewritting in the 6.3 cycle, this > fallback was broken - the old code used args->fsbno as the both the > allocation target and the allocation result, the new code passes the > target as a separate parameter. The conversion didn't handle the > aligned->unaligned fallback path correctly - it reset args->fsbno to > the target fsbno on failure which broke allocation failure detection > in the high level code and so it never fell back to unaligned > allocations. > > This resulted in a loop in writeback trying to allocate an aligned > block, getting a false positive success, trying to insert the result > in the BMBT. This did nothing because the extent already was in the > BMBT (merge results in an unchanged extent) and so it returned the > prior extent to the conversion code as the current iomap. > > Because the iomap returned didn't cover the offset we tried to map, > xfs_convert_blocks() then retries the allocation, which fails in the > same way and now we have a livelock. > > Reported-by: Brian Foster <bfoster@redhat.com> > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > Signed-off-by: Dave Chinner <dchinner@redhat.com> Insofar as this has revealed a whole ton of *more* problems in mkfs, Reviewed-by: Darrick J. Wong <djwong@kernel.org> Specifically: if I set su=128k,sw=4, some tests will try to format a 512M filesystem. This results in an 8-AG filesystem with a log that fills up almost but not all of an entire AG. The AG then ends up with an empty bnobt and an empty AGFL, and 25 missing blocks... ...oh and the new test vms that run this config failed to finish for some reason. Sigh. --D > --- > fs/xfs/libxfs/xfs_bmap.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c > index 1a4e446194dd..b512de0540d5 100644 > --- a/fs/xfs/libxfs/xfs_bmap.c > +++ b/fs/xfs/libxfs/xfs_bmap.c > @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof( > * original non-aligned state so the caller can proceed on allocation > * failure as if this function was never called. > */ > - args->fsbno = ap->blkno; > args->alignment = 1; > return 0; > } > -- > 2.39.2 >
On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote: > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > > From: Dave Chinner <dchinner@redhat.com> > > > > On a filesystem with a non-zero stripe unit and a large sequential > > write, delayed allocation will set a minimum allocation length of > > the stripe unit. If allocation fails because there are no extents > > long enough for an aligned minlen allocation, it is supposed to > > fall back to unaligned allocation which allows single block extents > > to be allocated. > > > > When the allocator code was rewritting in the 6.3 cycle, this > > fallback was broken - the old code used args->fsbno as the both the > > allocation target and the allocation result, the new code passes the > > target as a separate parameter. The conversion didn't handle the > > aligned->unaligned fallback path correctly - it reset args->fsbno to > > the target fsbno on failure which broke allocation failure detection > > in the high level code and so it never fell back to unaligned > > allocations. > > > > This resulted in a loop in writeback trying to allocate an aligned > > block, getting a false positive success, trying to insert the result > > in the BMBT. This did nothing because the extent already was in the > > BMBT (merge results in an unchanged extent) and so it returned the > > prior extent to the conversion code as the current iomap. > > > > Because the iomap returned didn't cover the offset we tried to map, > > xfs_convert_blocks() then retries the allocation, which fails in the > > same way and now we have a livelock. > > > > Reported-by: Brian Foster <bfoster@redhat.com> > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > Insofar as this has revealed a whole ton of *more* problems in mkfs, > Reviewed-by: Darrick J. Wong <djwong@kernel.org> Thanks, I've added this to for-next and I'll include it in the pull req to Linus tomorrow because I don't want expose everyone using merge window kernels to this ENOSPC issue even for a short while. > Specifically: if I set su=128k,sw=4, some tests will try to format a > 512M filesystem. This results in an 8-AG filesystem with a log that > fills up almost but not all of an entire AG. The AG then ends up with > an empty bnobt and an empty AGFL, and 25 missing blocks... I used su=64k,sw=2 so I didn't see those specific issues. Mostly I see failures due to mkfs warnings like this: +Warning: AG size is a multiple of stripe width. This can cause performance +problems by aligning all AGs on the same disk. To avoid this, run mkfs with +an AG size that is one stripe unit smaller or larger, for example 129248. > ...oh and the new test vms that run this config failed to finish for > some reason. Sigh. Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing the xfs_repair process allows everything to keep going. I suspect it's a prefetch race/deadlock... -Dave.
On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote: > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote: > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > On a filesystem with a non-zero stripe unit and a large sequential > > > write, delayed allocation will set a minimum allocation length of > > > the stripe unit. If allocation fails because there are no extents > > > long enough for an aligned minlen allocation, it is supposed to > > > fall back to unaligned allocation which allows single block extents > > > to be allocated. > > > > > > When the allocator code was rewritting in the 6.3 cycle, this > > > fallback was broken - the old code used args->fsbno as the both the > > > allocation target and the allocation result, the new code passes the > > > target as a separate parameter. The conversion didn't handle the > > > aligned->unaligned fallback path correctly - it reset args->fsbno to > > > the target fsbno on failure which broke allocation failure detection > > > in the high level code and so it never fell back to unaligned > > > allocations. > > > > > > This resulted in a loop in writeback trying to allocate an aligned > > > block, getting a false positive success, trying to insert the result > > > in the BMBT. This did nothing because the extent already was in the > > > BMBT (merge results in an unchanged extent) and so it returned the > > > prior extent to the conversion code as the current iomap. > > > > > > Because the iomap returned didn't cover the offset we tried to map, > > > xfs_convert_blocks() then retries the allocation, which fails in the > > > same way and now we have a livelock. > > > > > > Reported-by: Brian Foster <bfoster@redhat.com> > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > Insofar as this has revealed a whole ton of *more* problems in mkfs, > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > Thanks, I've added this to for-next and I'll include it in the pull > req to Linus tomorrow because I don't want expose everyone using > merge window kernels to this ENOSPC issue even for a short while. > > > Specifically: if I set su=128k,sw=4, some tests will try to format a > > 512M filesystem. This results in an 8-AG filesystem with a log that > > fills up almost but not all of an entire AG. The AG then ends up with > > an empty bnobt and an empty AGFL, and 25 missing blocks... > > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I > see failures due to mkfs warnings like this: > > +Warning: AG size is a multiple of stripe width. This can cause performance > +problems by aligning all AGs on the same disk. To avoid this, run mkfs with > +an AG size that is one stripe unit smaller or larger, for example 129248. Yeah, I noticed that one, and am testing a patch to quiet down mkfs a little bit. I also caught a bug in the AG formatting code where the bnobt gets written out with zero records if the log happens to start beyond m_ag_prealloc_size and end at EOAG. I also noticed that the percpu inodegc workers occasionally run on the wrong CPU, but only on arm. Tonight I intend to test a fix for that... ...but I also have been tracking a fix for an issue where xfs_inodegc_stop races with either the reclaim inodegc kicker or with an already set-up delayed work timer, with the end result that drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker itself) tries to queue_work the inodegc worker to the draining workqueue, and we get a kernel bug message and the fs livelocks. I've also been trying to fix that problem that Ritesh mentioned months ago where if we manage to mount the fs cleanly but there are unlinked inodes, we'll eventually fall over when the incore unlinked list fails to find those lingering unlinked inodes. I also added a su=128k,sw=4 config to the fstests fleet and am now trying to fix all the fstests bugs that produce incorrect test failures. > > ...oh and the new test vms that run this config failed to finish for > > some reason. Sigh. > > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing > the xfs_repair process allows everything to keep going. I suspect > it's a prefetch race/deadlock... <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock where the pthread mutex says the lock is owned by a thread that is no longer running. --D > -Dave. > > -- > Dave Chinner > david@fromorbit.com
On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote: > On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote: > > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote: > > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > On a filesystem with a non-zero stripe unit and a large sequential > > > > write, delayed allocation will set a minimum allocation length of > > > > the stripe unit. If allocation fails because there are no extents > > > > long enough for an aligned minlen allocation, it is supposed to > > > > fall back to unaligned allocation which allows single block extents > > > > to be allocated. > > > > > > > > When the allocator code was rewritting in the 6.3 cycle, this > > > > fallback was broken - the old code used args->fsbno as the both the > > > > allocation target and the allocation result, the new code passes the > > > > target as a separate parameter. The conversion didn't handle the > > > > aligned->unaligned fallback path correctly - it reset args->fsbno to > > > > the target fsbno on failure which broke allocation failure detection > > > > in the high level code and so it never fell back to unaligned > > > > allocations. > > > > > > > > This resulted in a loop in writeback trying to allocate an aligned > > > > block, getting a false positive success, trying to insert the result > > > > in the BMBT. This did nothing because the extent already was in the > > > > BMBT (merge results in an unchanged extent) and so it returned the > > > > prior extent to the conversion code as the current iomap. > > > > > > > > Because the iomap returned didn't cover the offset we tried to map, > > > > xfs_convert_blocks() then retries the allocation, which fails in the > > > > same way and now we have a livelock. > > > > > > > > Reported-by: Brian Foster <bfoster@redhat.com> > > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > Insofar as this has revealed a whole ton of *more* problems in mkfs, > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > Thanks, I've added this to for-next and I'll include it in the pull > > req to Linus tomorrow because I don't want expose everyone using > > merge window kernels to this ENOSPC issue even for a short while. > > > > > Specifically: if I set su=128k,sw=4, some tests will try to format a > > > 512M filesystem. This results in an 8-AG filesystem with a log that > > > fills up almost but not all of an entire AG. The AG then ends up with > > > an empty bnobt and an empty AGFL, and 25 missing blocks... > > > > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I > > see failures due to mkfs warnings like this: > > > > +Warning: AG size is a multiple of stripe width. This can cause performance > > +problems by aligning all AGs on the same disk. To avoid this, run mkfs with > > +an AG size that is one stripe unit smaller or larger, for example 129248. > > Yeah, I noticed that one, and am testing a patch to quiet down mkfs a > little bit. > > I also caught a bug in the AG formatting code where the bnobt gets > written out with zero records if the log happens to start beyond > m_ag_prealloc_size and end at EOAG. > > I also noticed that the percpu inodegc workers occasionally run on the > wrong CPU, but only on arm. Tonight I intend to test a fix for that... Whacky. :/ > ...but I also have been tracking a fix for an issue where > xfs_inodegc_stop races with either the reclaim inodegc kicker or with an > already set-up delayed work timer, with the end result that > drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker > itself) tries to queue_work the inodegc worker to the draining > workqueue, and we get a kernel bug message and the fs livelocks. Yes, I've noticed that but not had time to fix it - disabling the inodegc whilst stuff is still in progress after checking inodegc is enabled is racy... > I've also been trying to fix that problem that Ritesh mentioned months > ago where if we manage to mount the fs cleanly but there are unlinked > inodes, we'll eventually fall over when the incore unlinked list fails > to find those lingering unlinked inodes. Right, I also haven't had time to get to that either. IIRC it fails to clear the bucket because we don't feed the error from the inodegc context back to the recovery code. > I also added a su=128k,sw=4 config to the fstests fleet and am now > trying to fix all the fstests bugs that produce incorrect test failures. The other thing I noticed is a couple of the FIEMAP tests fail because they find data blocks where they expect holes such as: generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad) --- tests/generic/225.out 2022-12-21 15:53:25.479044361 +1100 +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad 2023-04-26 04:24:31.426016818 +1000 @@ -1,3 +1,79 @@ QA output created by 225 fiemap run without preallocation, with sync +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35 +ERROR: found an allocated extent where a hole should be: 35 +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH' +logical: [ 27.. 27] phys: 67.. 67 flags: 0x000 tot: 1 +logical: [ 29.. 31] phys: 69.. 71 flags: 0x000 tot: 3 ... (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad' to see the entire diff) I haven't looked into this yet, but nothing is reporting data corruptions so I suspect it's just the stripe aligned allocation leaving unwritten extents in places the test is expecting holes to exist... > > > ...oh and the new test vms that run this config failed to finish for > > > some reason. Sigh. > > > > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing > > the xfs_repair process allows everything to keep going. I suspect > > it's a prefetch race/deadlock... > > <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock > where the pthread mutex says the lock is owned by a thread that is no > longer running. So we leaked a buffer lock somewhere? Cheers, Dave.
On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote: > On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote: > > On Thu, Apr 27, 2023 at 09:01:35AM +1000, Dave Chinner wrote: > > > On Tue, Apr 25, 2023 at 08:20:52AM -0700, Darrick J. Wong wrote: > > > > On Sat, Apr 22, 2023 at 08:24:40AM +1000, Dave Chinner wrote: > > > > > From: Dave Chinner <dchinner@redhat.com> > > > > > > > > > > On a filesystem with a non-zero stripe unit and a large sequential > > > > > write, delayed allocation will set a minimum allocation length of > > > > > the stripe unit. If allocation fails because there are no extents > > > > > long enough for an aligned minlen allocation, it is supposed to > > > > > fall back to unaligned allocation which allows single block extents > > > > > to be allocated. > > > > > > > > > > When the allocator code was rewritting in the 6.3 cycle, this > > > > > fallback was broken - the old code used args->fsbno as the both the > > > > > allocation target and the allocation result, the new code passes the > > > > > target as a separate parameter. The conversion didn't handle the > > > > > aligned->unaligned fallback path correctly - it reset args->fsbno to > > > > > the target fsbno on failure which broke allocation failure detection > > > > > in the high level code and so it never fell back to unaligned > > > > > allocations. > > > > > > > > > > This resulted in a loop in writeback trying to allocate an aligned > > > > > block, getting a false positive success, trying to insert the result > > > > > in the BMBT. This did nothing because the extent already was in the > > > > > BMBT (merge results in an unchanged extent) and so it returned the > > > > > prior extent to the conversion code as the current iomap. > > > > > > > > > > Because the iomap returned didn't cover the offset we tried to map, > > > > > xfs_convert_blocks() then retries the allocation, which fails in the > > > > > same way and now we have a livelock. > > > > > > > > > > Reported-by: Brian Foster <bfoster@redhat.com> > > > > > Fixes: 85843327094f ("xfs: factor xfs_bmap_btalloc()") > > > > > Signed-off-by: Dave Chinner <dchinner@redhat.com> > > > > > > > > Insofar as this has revealed a whole ton of *more* problems in mkfs, > > > > Reviewed-by: Darrick J. Wong <djwong@kernel.org> > > > > > > Thanks, I've added this to for-next and I'll include it in the pull > > > req to Linus tomorrow because I don't want expose everyone using > > > merge window kernels to this ENOSPC issue even for a short while. > > > > > > > Specifically: if I set su=128k,sw=4, some tests will try to format a > > > > 512M filesystem. This results in an 8-AG filesystem with a log that > > > > fills up almost but not all of an entire AG. The AG then ends up with > > > > an empty bnobt and an empty AGFL, and 25 missing blocks... > > > > > > I used su=64k,sw=2 so I didn't see those specific issues. Mostly I > > > see failures due to mkfs warnings like this: > > > > > > +Warning: AG size is a multiple of stripe width. This can cause performance > > > +problems by aligning all AGs on the same disk. To avoid this, run mkfs with > > > +an AG size that is one stripe unit smaller or larger, for example 129248. > > > > Yeah, I noticed that one, and am testing a patch to quiet down mkfs a > > little bit. > > > > I also caught a bug in the AG formatting code where the bnobt gets > > written out with zero records if the log happens to start beyond > > m_ag_prealloc_size and end at EOAG. > > > > I also noticed that the percpu inodegc workers occasionally run on the > > wrong CPU, but only on arm. Tonight I intend to test a fix for that... > > Whacky. :/ > > > ...but I also have been tracking a fix for an issue where > > xfs_inodegc_stop races with either the reclaim inodegc kicker or with an > > already set-up delayed work timer, with the end result that > > drain_workqueue sets WQ_DRAINING, someone (not the inodegc worker > > itself) tries to queue_work the inodegc worker to the draining > > workqueue, and we get a kernel bug message and the fs livelocks. > > Yes, I've noticed that but not had time to fix it - disabling the > inodegc whilst stuff is still in progress after checking inodegc is > enabled is racy... > > > I've also been trying to fix that problem that Ritesh mentioned months > > ago where if we manage to mount the fs cleanly but there are unlinked > > inodes, we'll eventually fall over when the incore unlinked list fails > > to find those lingering unlinked inodes. > > Right, I also haven't had time to get to that either. IIRC it fails > to clear the bucket because we don't feed the error from the inodegc > context back to the recovery code. > > > I also added a su=128k,sw=4 config to the fstests fleet and am now > > trying to fix all the fstests bugs that produce incorrect test failures. > > The other thing I noticed is a couple of the FIEMAP tests fail > because they find data blocks where they expect holes such as: > > generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad) > --- tests/generic/225.out 2022-12-21 15:53:25.479044361 +1100 > +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad 2023-04-26 04:24:31.426016818 +1000 > @@ -1,3 +1,79 @@ > QA output created by 225 > fiemap run without preallocation, with sync > +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35 > +ERROR: found an allocated extent where a hole should be: 35 > +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH' > +logical: [ 27.. 27] phys: 67.. 67 flags: 0x000 tot: 1 > +logical: [ 29.. 31] phys: 69.. 71 flags: 0x000 tot: 3 > ... > (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad' to see the entire diff) > > I haven't looked into this yet, but nothing is reporting data > corruptions so I suspect it's just the stripe aligned allocation > leaving unwritten extents in places the test is expecting holes to > exist... That's the FIEMAP tester program not expecting that areas of the file that it didn't write to can have unwritten extents mapped. I'm testing patches to fix all that tonight too. If I can ever get these %#@%)#%!!! orchestration scripts to work correctly. > > > > ...oh and the new test vms that run this config failed to finish for > > > > some reason. Sigh. > > > > > > Yeah, I've had xfs_repair hang in xfs/155 a couple of times. Killing > > > the xfs_repair process allows everything to keep going. I suspect > > > it's a prefetch race/deadlock... > > > > <nod> I periodically catch xfs_repair deadlocked on an xfs_buf lock > > where the pthread mutex says the lock is owned by a thread that is no > > longer running. > > So we leaked a buffer lock somewhere? Yep. It also shows up every now and then in the fstests that fuzz an entire directory block and see if repair will fix it. --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Wed, Apr 26, 2023 at 05:53:33PM -0700, Darrick J. Wong wrote: > On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote: > > On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote: > > > I also added a su=128k,sw=4 config to the fstests fleet and am now > > > trying to fix all the fstests bugs that produce incorrect test failures. > > > > The other thing I noticed is a couple of the FIEMAP tests fail > > because they find data blocks where they expect holes such as: > > > > generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad) > > --- tests/generic/225.out 2022-12-21 15:53:25.479044361 +1100 > > +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad 2023-04-26 04:24:31.426016818 +1000 > > @@ -1,3 +1,79 @@ > > QA output created by 225 > > fiemap run without preallocation, with sync > > +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35 > > +ERROR: found an allocated extent where a hole should be: 35 > > +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH' > > +logical: [ 27.. 27] phys: 67.. 67 flags: 0x000 tot: 1 > > +logical: [ 29.. 31] phys: 69.. 71 flags: 0x000 tot: 3 > > ... > > (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad' to see the entire diff) > > > > I haven't looked into this yet, but nothing is reporting data > > corruptions so I suspect it's just the stripe aligned allocation > > leaving unwritten extents in places the test is expecting holes to > > exist... > > That's the FIEMAP tester program not expecting that areas of the file > that it didn't write to can have unwritten extents mapped. I'm testing > patches to fix all that tonight too. If I can ever get these %#@%)#%!!! > orchestration scripts to work correctly. OK. FWIW, I've just found another bug in the stripe aligned allocation at EOF that is triggered by the filestreams code hitting ENOSPC conditions. xfs/170 seems to hit it fairly reliably - it's marking args->pag as NULL and not resetting the caller pag correctly and the high level filestreams failure code is expecting args->pag to be set because it owns the reference... I hope to have a fix for that one on the list this afternoon.... -Dave.
On Thu, Apr 27, 2023 at 03:26:00PM +1000, Dave Chinner wrote: > On Wed, Apr 26, 2023 at 05:53:33PM -0700, Darrick J. Wong wrote: > > On Thu, Apr 27, 2023 at 10:11:24AM +1000, Dave Chinner wrote: > > > On Wed, Apr 26, 2023 at 04:38:31PM -0700, Darrick J. Wong wrote: > > > > I also added a su=128k,sw=4 config to the fstests fleet and am now > > > > trying to fix all the fstests bugs that produce incorrect test failures. > > > > > > The other thing I noticed is a couple of the FIEMAP tests fail > > > because they find data blocks where they expect holes such as: > > > > > > generic/225 21s ... - output mismatch (see /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad) > > > --- tests/generic/225.out 2022-12-21 15:53:25.479044361 +1100 > > > +++ /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad 2023-04-26 04:24:31.426016818 +1000 > > > @@ -1,3 +1,79 @@ > > > QA output created by 225 > > > fiemap run without preallocation, with sync > > > +ERROR: FIEMAP claimed there was data at a block which should be a hole, and FIBMAP confirmend that it is in fact a hole, so FIEMAP is wrong: 35 > > > +ERROR: found an allocated extent where a hole should be: 35 > > > +map is 'DHDDHHDDHDDHHHHDDDDDHHHHHHHDHDDDHHDHDHHHHHDDHDDHHDDHDHHDDDHHHHDDDDHDHHDDHHHDDDDHHDHDDDHHDHDDDHDHHHHHDHDHDHDHHDDHDHHHHDHHDDDDDDDH' > > > +logical: [ 27.. 27] phys: 67.. 67 flags: 0x000 tot: 1 > > > +logical: [ 29.. 31] phys: 69.. 71 flags: 0x000 tot: 3 > > > ... > > > (Run 'diff -u /home/dave/src/xfstests-dev/tests/generic/225.out /home/dave/src/xfstests-dev/results//xfs_align/generic/225.out.bad' to see the entire diff) > > > > > > I haven't looked into this yet, but nothing is reporting data > > > corruptions so I suspect it's just the stripe aligned allocation > > > leaving unwritten extents in places the test is expecting holes to > > > exist... > > > > That's the FIEMAP tester program not expecting that areas of the file > > that it didn't write to can have unwritten extents mapped. I'm testing > > patches to fix all that tonight too. If I can ever get these %#@%)#%!!! > > orchestration scripts to work correctly. > > OK. > > FWIW, I've just found another bug in the stripe aligned allocation > at EOF that is triggered by the filestreams code hitting ENOSPC > conditions. xfs/170 seems to hit it fairly reliably - it's marking > args->pag as NULL and not resetting the caller pag correctly and the > high level filestreams failure code is expecting args->pag to be set > because it owns the reference... > > I hope to have a fix for that one on the list this afternoon.... Oh, yeah, I hit that one too. I'll send out my fixes after the ext4 concall and we can sync up on that. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 1a4e446194dd..b512de0540d5 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -3540,7 +3540,6 @@ xfs_bmap_btalloc_at_eof( * original non-aligned state so the caller can proceed on allocation * failure as if this function was never called. */ - args->fsbno = ap->blkno; args->alignment = 1; return 0; }