Message ID | 169687551395.3948976.8425812597156927952.stgit@frogsfrogsfrogs (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | fstests: random fixes for v2023.10.08 | expand |
On Mon, Oct 09, 2023 at 11:18:33AM -0700, Darrick J. Wong wrote: > The storage advertises SCSI UNMAP support, but it is of the variety > where the UNMAP command returns immediately but takes its time to unmap > in the background. Subsequent rereads are allowed to return stale > contents, per DISCARD semantics. > > When the fstests cloud is not busy, the old contents disappear in a few > seconds. However, at peak utilization, there are ~75 VMs running, and > the storage backend can take several minutes to commit these background > requests. Umm, that is not valid behavior fo SCSI UNMAP or any other command that Linux discard maps to. All of them can do one of the two options on a per-block basis: - return the unmap pattern (usually but not always 0) for any read following the unmap/trim/discard - always return the previous pattern until it is overwritten or discarded again Changing the pattern some time after unmap is a grave bug, and we need to blacklist the device.
On Tue, Oct 10, 2023 at 12:01:33AM -0700, Christoph Hellwig wrote: > On Mon, Oct 09, 2023 at 11:18:33AM -0700, Darrick J. Wong wrote: > > The storage advertises SCSI UNMAP support, but it is of the variety > > where the UNMAP command returns immediately but takes its time to unmap > > in the background. Subsequent rereads are allowed to return stale > > contents, per DISCARD semantics. > > > > When the fstests cloud is not busy, the old contents disappear in a few > > seconds. However, at peak utilization, there are ~75 VMs running, and > > the storage backend can take several minutes to commit these background > > requests. > > Umm, that is not valid behavior fo SCSI UNMAP or any other command > that Linux discard maps to. All of them can do one of the two options > on a per-block basis: > > - return the unmap pattern (usually but not always 0) for any read > following the unmap/trim/discard > - always return the previous pattern until it is overwritten or > discarded again > > Changing the pattern some time after unmap is a grave bug, and we need > to blacklist the device. Ok, I'll go pester them about fixing that, if they haven't already. Apparently discard support is somewhat new. I'm pretty sure I've seen some NVME SSDs where you can issue devicewide DISCARDs and slowly watch the namespace utilization go down over tens of minutes; and reads will only eventually start returning zeroes. (Note that *writes* during the slow-discard period are persisted correctly.) However, that's orthogonal to this patch -- if the device doesn't support discard, _scratch_mkfs won't zero the entire disk to remove old dead superblocks that might have been written by previous tests. After we shatter the primary super, the xfs_repair scanning code can still trip over those old supers and break the golden output. --D
On Wed, Oct 11, 2023 at 12:10:25PM -0700, Darrick J. Wong wrote: > I'm pretty sure I've seen some NVME SSDs where you can issue devicewide > DISCARDs and slowly watch the namespace utilization go down over tens of > minutes; and reads will only eventually start returning zeroes. Well, the second part is broken. The first part is fine, and I've briefly consulted with a firmware team implementing such a feature. It just needs to make sure to return zeroes right after the return of the discard even if the blocks aren't erased yet, including after a powerfail. (anyone who knows the XFS truncate / hole punch code will have a vague idea of how that could work). > However, that's orthogonal to this patch -- if the device doesn't > support discard, _scratch_mkfs won't zero the entire disk to remove old > dead superblocks that might have been written by previous tests. After > we shatter the primary super, the xfs_repair scanning code can still > trip over those old supers and break the golden output. True. I have to admit I stopped reading the patch after the unmap description. I'll take another look.
On Wed, Oct 11, 2023 at 09:26:51PM -0700, Christoph Hellwig wrote: > On Wed, Oct 11, 2023 at 12:10:25PM -0700, Darrick J. Wong wrote: > > I'm pretty sure I've seen some NVME SSDs where you can issue devicewide > > DISCARDs and slowly watch the namespace utilization go down over tens of > > minutes; and reads will only eventually start returning zeroes. > > Well, the second part is broken. The first part is fine, and I've briefly > consulted with a firmware team implementing such a feature. It just needs > to make sure to return zeroes right after the return of the discard > even if the blocks aren't erased yet, including after a powerfail. > (anyone who knows the XFS truncate / hole punch code will have a vague > idea of how that could work). > > > However, that's orthogonal to this patch -- if the device doesn't > > support discard, _scratch_mkfs won't zero the entire disk to remove old > > dead superblocks that might have been written by previous tests. After > > we shatter the primary super, the xfs_repair scanning code can still > > trip over those old supers and break the golden output. > > True. I have to admit I stopped reading the patch after the unmap > description. I'll take another look. <nod> I think I'll update the next version of this patch to substitute the paragraph that I wrote above for all the misleading ramblings about DISCARD. Today's revisit of that clod block device doesn't show the weird stale reads that disappear behavior, so it's entirely possible that they've fixed it already. --D
Unlike the commit log, the changes look good to me.
--- /tmp/fstests/tests/xfs/178.out 2023-07-11 12:18:21.714970364 -0700 +++ /var/tmp/fstests/xfs/178.out.bad 2023-07-25 22:05:39.756000000 -0700 @@ -10,6 +10,20 @@ bad primary superblock - bad magic numbe attempting to find secondary superblock... found candidate secondary superblock... +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... +error reading superblock 1 -- seek to offset 584115421184 failed +unable to verify superblock, continuing... +found candidate secondary superblock... verified secondary superblock... writing modified primary superblock sb root inode INO inconsistent with calculated value INO Eventually I tracked this down to a mis-interaction between the test, xfs_repair, and the storage device. The storage advertises SCSI UNMAP support, but it is of the variety where the UNMAP command returns immediately but takes its time to unmap in the background. Subsequent rereads are allowed to return stale contents, per DISCARD semantics. When the fstests cloud is not busy, the old contents disappear in a few seconds. However, at peak utilization, there are ~75 VMs running, and the storage backend can take several minutes to commit these background requests. When we zero the primary super and start xfs_repair on SCRATCH_DEV, it will walk the device looking for secondary supers. Most of the time it finds the actual AG 1 secondary super, but sometimes it finds ghosts from previous formats. When that happens, xfs_repair will talk quite a bit about those failed secondaries, even if it eventually finds an acceptable secondary sb and completes the repair. Filter out the messages about secondary supers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> --- tests/xfs/178 | 9 ++++++++- tests/xfs/178.out | 2 -- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/tests/xfs/178 b/tests/xfs/178 index a65197cde3..fee1e92bf3 100755 --- a/tests/xfs/178 +++ b/tests/xfs/178 @@ -10,13 +10,20 @@ . ./common/preamble _begin_fstest mkfs other auto +filter_repair() { + _filter_repair | sed \ + -e '/unable to verify superblock, continuing/d' \ + -e '/found candidate secondary superblock/d' \ + -e '/error reading superblock.*-- seek to offset/d' +} + # dd the 1st sector then repair _dd_repair_check() { #dd first sector dd if=/dev/zero of=$1 bs=$2 count=1 2>&1 | _filter_dd #xfs_repair - _scratch_xfs_repair 2>&1 | _filter_repair + _scratch_xfs_repair 2>&1 | filter_repair #check repair if _check_scratch_fs; then echo "repair passed" diff --git a/tests/xfs/178.out b/tests/xfs/178.out index 0bebe553eb..711e90cc26 100644 --- a/tests/xfs/178.out +++ b/tests/xfs/178.out @@ -9,7 +9,6 @@ Phase 1 - find and verify superblock... bad primary superblock - bad magic number !!! attempting to find secondary superblock... -found candidate secondary superblock... verified secondary superblock... writing modified primary superblock sb root inode INO inconsistent with calculated value INO @@ -45,7 +44,6 @@ Phase 1 - find and verify superblock... bad primary superblock - bad magic number !!! attempting to find secondary superblock... -found candidate secondary superblock... verified secondary superblock... writing modified primary superblock sb root inode INO inconsistent with calculated value INO
From: Darrick J. Wong <djwong@kernel.org> When I added an fstests config for "RAID" striping (aka MKFS_OPTIONS='-d su=128k,sw=4'), I suddenly started seeing this test fail sporadically with: