Message ID | 20240416235108.3391394-1-mcgrof@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [kdevops] xfs: add xfs/242 as failing on xfs_reflink_2k | expand |
On Tue, Apr 16, 2024 at 04:51:08PM -0700, Luis Chamberlain wrote: > This test is rather simple, and somehow we managed to capture a > non-crash failure. The test was added to fstests via fstests commit > 0c95fadc35c8e450 ("expand 252 with more corner case tests") which > essentially does this: > > + $XFS_IO_PROG $xfs_io_opt -f -c "truncate $block_size" \ > + -c "pwrite 0 $block_size" $sync_cmd \ > + -c "$zero_cmd 128 128" \ > + -c "$map_cmd -v" $testfile | $filter_cmd > > The map_cmd in this case is: 'bmap -p'. So the test does: > > a) truncates data to the block size > b) sync > c) zero-fills the the blocksize Which subtest is this? I've seen periodic failures in xfs/242 that I can't reproduce either: --- /tmp/fstests/tests/xfs/242.out 2024-02-28 16:20:24.448887914 -0800 +++ /var/tmp/fstests/xfs/242.out.bad 2024-04-15 17:36:46.736000000 -0700 @@ -6,8 +6,7 @@ QA output created by 242 1aca77e2188f52a62674fe8a873bdaba 2. into allocated space 0: [0..127]: data -1: [128..383]: unwritten -2: [384..639]: data +1: [128..639]: unwritten 2f7a72b9ca9923b610514a11a45a80c9 3. into unwritten space 0: [0..639]: unwritten It's very strange to me that the block map changes but the md5 doesn't? The pwrite should have written 0xcd into the file and then the space between 64k and 192K got replaced with an unwritten extent. But everything between 192K and 320K should still be written data. --D > The xfs_io bmap displays the block mapping for the current open file. > Since our failed delta is: > > -0: [0..7]: data > +0: [0..7]: unwritten > > So it would seem we somehow managed to race to write, but it never > went anywhere. I can't reproduce yet, but figured I'd put this out > there to at least acknowledge its seen at least once. > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> > --- > > Super rare to trigger this but figured I'd check to see if others have seen > this fail before. This was on vanilla v6.8-rc2. I'm wondering a race is > possible with a guest using sparse files on the host, and the host somehow > incorrectly informing the guest the write is done. btrfs sparse files > were used on the host for the drives used by this guest for scratch drives. > > .../fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt b/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt > index f6ea47b0479f..8b4161f3700e 100644 > --- a/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt > +++ b/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt > @@ -19,6 +19,7 @@ xfs/075 > xfs/155 > xfs/168 > xfs/188 > +xfs/242 # F:1/2000 non-fatal failure cosmic ray? https://gist.github.com/mcgrof/6ef50311179a65221413a63c0cc8efd1 > xfs/270 > xfs/301 > xfs/502 # F:1/8 korg#218226 xfs assert fs/xfs/xfs_message.c:102 > -- > 2.43.0 > >
On Tue, Apr 16, 2024 at 05:37:55PM -0700, Darrick J. Wong wrote: > On Tue, Apr 16, 2024 at 04:51:08PM -0700, Luis Chamberlain wrote: > > This test is rather simple, and somehow we managed to capture a > > non-crash failure. The test was added to fstests via fstests commit > > 0c95fadc35c8e450 ("expand 252 with more corner case tests") which > > essentially does this: > > > > + $XFS_IO_PROG $xfs_io_opt -f -c "truncate $block_size" \ > > + -c "pwrite 0 $block_size" $sync_cmd \ > > + -c "$zero_cmd 128 128" \ > > + -c "$map_cmd -v" $testfile | $filter_cmd > > > > The map_cmd in this case is: 'bmap -p'. So the test does: > > > > a) truncates data to the block size > > b) sync > > c) zero-fills the the blocksize > > Which subtest is this? It's the: 17. data -> hole in single block file > I've seen periodic failures in xfs/242 that I can't reproduce either: Oh good to hear. > --- /tmp/fstests/tests/xfs/242.out 2024-02-28 16:20:24.448887914 -0800 > +++ /var/tmp/fstests/xfs/242.out.bad 2024-04-15 17:36:46.736000000 -0700 > @@ -6,8 +6,7 @@ QA output created by 242 > 1aca77e2188f52a62674fe8a873bdaba > 2. into allocated space > 0: [0..127]: data > -1: [128..383]: unwritten > -2: [384..639]: data > +1: [128..639]: unwritten > 2f7a72b9ca9923b610514a11a45a80c9 > 3. into unwritten space > 0: [0..639]: unwritten Oh curious, you hit #2 while I saw #17. VM or bare metal? If VM, real drive or sparse files? Mine guest files are virtio drives on files placed on the host on an XFS partition, the guest uses btrfs truncated files for the sparse files and loopback devices. ie: TEST_DEV=/dev/loop16 kdevops@base-xfs-reflink-2k ~ $ sudo losetup -a| grep loop16 /dev/loop16: [0038]:268 (/media/sparsefiles/sparse-disk16) kdevops@base-xfs-reflink-2k ~ $ df -h /media/sparsefiles/sparse-disk16 Filesystem Size Used Avail Use% Mounted on /dev/vdc 100G 408M 96G 1% /media/sparsefiles kdevops@base-xfs-reflink-2k ~ $ du -hs /media/sparsefiles/sparse-disk16 70M /media/sparsefiles/sparse-disk16 kdevops@base-xfs-reflink-2k ~ $ mount | grep -E "sparse|loop16" /dev/vdc on /media/sparsefiles type btrfs (rw,relatime,discard=async,noacl,space_cache=v2,subvolid=5,subvol=/) /dev/loop16 on /media/test type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) > It's very strange to me that the block map changes but the md5 doesn't? Great point! How is that possible? > The pwrite should have written 0xcd into the file and then the space > between 64k and 192K got replaced with an unwritten extent. But > everything between 192K and 320K should still be written data. Luis
On Tue, Apr 16, 2024 at 04:51:08PM -0700, Luis Chamberlain wrote: > This test is rather simple, and somehow we managed to capture a > non-crash failure. The test was added to fstests via fstests commit > 0c95fadc35c8e450 ("expand 252 with more corner case tests") which > essentially does this: > > + $XFS_IO_PROG $xfs_io_opt -f -c "truncate $block_size" \ > + -c "pwrite 0 $block_size" $sync_cmd \ > + -c "$zero_cmd 128 128" \ > + -c "$map_cmd -v" $testfile | $filter_cmd > > The map_cmd in this case is: 'bmap -p'. So the test does: > > a) truncates data to the block size > b) sync > c) zero-fills the the blocksize > > The xfs_io bmap displays the block mapping for the current open file. > Since our failed delta is: > > -0: [0..7]: data > +0: [0..7]: unwritten That's most likely a _filter_bmap() issue, not a kernel code bug. i.e. 'bmap -vp' output looks like: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..231]: 2076367680..2076367911 18 (6251328..6251559) 232 000000 and _filter_bmap has two separate regex matches against different fields that both trigger "unwritten" output. The first check is against field 5 which is actually the AG-OFFSET in this output, not field 7 which is the FLAGS field. Hence if the ag offset matches '/0[01][01][01][01]/' the filter will emit 'unwritten' regardless of what the flags say it actually is. -Dave.
diff --git a/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt b/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt index f6ea47b0479f..8b4161f3700e 100644 --- a/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt +++ b/workflows/fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt @@ -19,6 +19,7 @@ xfs/075 xfs/155 xfs/168 xfs/188 +xfs/242 # F:1/2000 non-fatal failure cosmic ray? https://gist.github.com/mcgrof/6ef50311179a65221413a63c0cc8efd1 xfs/270 xfs/301 xfs/502 # F:1/8 korg#218226 xfs assert fs/xfs/xfs_message.c:102
This test is rather simple, and somehow we managed to capture a non-crash failure. The test was added to fstests via fstests commit 0c95fadc35c8e450 ("expand 252 with more corner case tests") which essentially does this: + $XFS_IO_PROG $xfs_io_opt -f -c "truncate $block_size" \ + -c "pwrite 0 $block_size" $sync_cmd \ + -c "$zero_cmd 128 128" \ + -c "$map_cmd -v" $testfile | $filter_cmd The map_cmd in this case is: 'bmap -p'. So the test does: a) truncates data to the block size b) sync c) zero-fills the the blocksize The xfs_io bmap displays the block mapping for the current open file. Since our failed delta is: -0: [0..7]: data +0: [0..7]: unwritten So it would seem we somehow managed to race to write, but it never went anywhere. I can't reproduce yet, but figured I'd put this out there to at least acknowledge its seen at least once. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- Super rare to trigger this but figured I'd check to see if others have seen this fail before. This was on vanilla v6.8-rc2. I'm wondering a race is possible with a guest using sparse files on the host, and the host somehow incorrectly informing the guest the write is done. btrfs sparse files were used on the host for the drives used by this guest for scratch drives. .../fstests/expunges/6.8.0-rc2/xfs/unassigned/xfs_reflink_2k.txt | 1 + 1 file changed, 1 insertion(+)