diff mbox series

[RFC] generic/730: ensure EIO after device delete

Message ID 6165b0018426e58e7427a26eebea3d63e66fed15.1710373423.git.boris@bur.io (mailing list archive)
State New, archived
Headers show
Series [RFC] generic/730: ensure EIO after device delete | expand

Commit Message

Boris Burkov March 13, 2024, 11:47 p.m. UTC
This test removes a SCSI debug device out from under a mounted
filesystem with a (probably) dirty file. This assumes that page cache
cannot save us from EIO, for a reason that I can't quite explain. In
fact, this test fails for exactly that reason, at least on btrfs.

The original patches:

      https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/

refer to this passing on xfs and not btrfs, so I suspect I am missing
something. With that said, on my machine this actually fails on xfs with
and without my patch, so this is clearly not enough.

High level, I am trying to understand what is really the expected
behavior from a filesystem under this condition and what this test is
getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
does pass with this additional syncing/cache dropping to nudge it to an
error.

Signed-off-by: Boris Burkov <boris@bur.io>
---
 tests/generic/730 | 3 +++
 1 file changed, 3 insertions(+)

Comments

Darrick J. Wong March 15, 2024, 2:56 a.m. UTC | #1
On Wed, Mar 13, 2024 at 04:47:47PM -0700, Boris Burkov wrote:
> This test removes a SCSI debug device out from under a mounted
> filesystem with a (probably) dirty file. This assumes that page cache
> cannot save us from EIO, for a reason that I can't quite explain. In
> fact, this test fails for exactly that reason, at least on btrfs.
> 
> The original patches:
> 
>       https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/
> 
> refer to this passing on xfs and not btrfs, so I suspect I am missing
> something. With that said, on my machine this actually fails on xfs with
> and without my patch, so this is clearly not enough.
> 
> High level, I am trying to understand what is really the expected
> behavior from a filesystem under this condition and what this test is
> getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
> does pass with this additional syncing/cache dropping to nudge it to an
> error.

Does btrfs prefetch pagecache data as soon as a file opens?  Or I guess
it could be that xfs trips an IO error and shuts down, and
xfs_file_read_iter will return EIO after that happens.

--D

> Signed-off-by: Boris Burkov <boris@bur.io>
> ---
>  tests/generic/730 | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/tests/generic/730 b/tests/generic/730
> index 11308cdaa..ca5037c57 100755
> --- a/tests/generic/730
> +++ b/tests/generic/730
> @@ -47,6 +47,9 @@ exec 3< $SCSI_DEBUG_MNT/testfile
>  # delete the scsi debug device while it still has dirty data
>  echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
>  
> +sync
> +echo 3 > /proc/sys/vm/drop_caches
> +
>  # try to read from the file, which should give us -EIO
>  cat <&3 > /dev/null
>  
> -- 
> 2.43.0
> 
>
Boris Burkov March 15, 2024, 11:30 p.m. UTC | #2
On Thu, Mar 14, 2024 at 07:56:28PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 13, 2024 at 04:47:47PM -0700, Boris Burkov wrote:
> > This test removes a SCSI debug device out from under a mounted
> > filesystem with a (probably) dirty file. This assumes that page cache
> > cannot save us from EIO, for a reason that I can't quite explain. In
> > fact, this test fails for exactly that reason, at least on btrfs.
> > 
> > The original patches:
> > 
> >       https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/
> > 
> > refer to this passing on xfs and not btrfs, so I suspect I am missing
> > something. With that said, on my machine this actually fails on xfs with
> > and without my patch, so this is clearly not enough.
> > 
> > High level, I am trying to understand what is really the expected
> > behavior from a filesystem under this condition and what this test is
> > getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
> > does pass with this additional syncing/cache dropping to nudge it to an
> > error.
> 
> Does btrfs prefetch pagecache data as soon as a file opens?  Or I guess
> it could be that xfs trips an IO error and shuts down, and
> xfs_file_read_iter will return EIO after that happens.
> 
> --D
> 

I might be wildly misunderstanding, since I'm very far from an expert on
the page cache, but didn't we just do a buffered write to the file? So it
would make sense for it to be in cache?

At any rate, what I observed on my system was that xfs does not pass
this test. Curious if that is only on my system or just the current
state of the test.

Summary of my test matrix:
ext4: passes with and without this patch
btrfs: fails without this patch, passes with it
xfs: fails with and without this patch

For that reason, I don't think this a good patch, I just mostly want to
figure out where to go with this test!

Thanks,
Boris

> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  tests/generic/730 | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/tests/generic/730 b/tests/generic/730
> > index 11308cdaa..ca5037c57 100755
> > --- a/tests/generic/730
> > +++ b/tests/generic/730
> > @@ -47,6 +47,9 @@ exec 3< $SCSI_DEBUG_MNT/testfile
> >  # delete the scsi debug device while it still has dirty data
> >  echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
> >  
> > +sync
> > +echo 3 > /proc/sys/vm/drop_caches
> > +
> >  # try to read from the file, which should give us -EIO
> >  cat <&3 > /dev/null
> >  
> > -- 
> > 2.43.0
> > 
> >
Filipe Manana March 16, 2024, 1:08 p.m. UTC | #3
On Fri, Mar 15, 2024 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
>
> On Thu, Mar 14, 2024 at 07:56:28PM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 13, 2024 at 04:47:47PM -0700, Boris Burkov wrote:
> > > This test removes a SCSI debug device out from under a mounted
> > > filesystem with a (probably) dirty file. This assumes that page cache
> > > cannot save us from EIO, for a reason that I can't quite explain. In
> > > fact, this test fails for exactly that reason, at least on btrfs.
> > >
> > > The original patches:
> > >
> > >       https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/
> > >
> > > refer to this passing on xfs and not btrfs, so I suspect I am missing
> > > something. With that said, on my machine this actually fails on xfs with
> > > and without my patch, so this is clearly not enough.
> > >
> > > High level, I am trying to understand what is really the expected
> > > behavior from a filesystem under this condition and what this test is
> > > getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
> > > does pass with this additional syncing/cache dropping to nudge it to an
> > > error.
> >
> > Does btrfs prefetch pagecache data as soon as a file opens?  Or I guess
> > it could be that xfs trips an IO error and shuts down, and
> > xfs_file_read_iter will return EIO after that happens.
> >
> > --D
> >
>
> I might be wildly misunderstanding, since I'm very far from an expert on
> the page cache, but didn't we just do a buffered write to the file? So it
> would make sense for it to be in cache?
>
> At any rate, what I observed on my system was that xfs does not pass
> this test. Curious if that is only on my system or just the current
> state of the test.

Instead of xfs, do you mean btrfs?

For me xfs and ext4 always pass this test, i.e. the cat command fails
with -EIO, both with and without your patch.

On the other hand, btrfs and f2fs always fail without your patch, and
pass with your patch applied.

From a btrfs point of view, it makes sense for the cat command to
succeed, since all the file data is in the page cache.

I don't know why the cat command fails (-EIO) with xfs and ext4, and I
haven't looked at any of their source code (my knowledge of those
filesystem's internals is too little anyway).

An alternative to your patch is just to make the write with direct IO
by passing -d to $XFS_IO_PROG. It makes the test succeed on these 4
filesystems.


>
> Summary of my test matrix:
> ext4: passes with and without this patch
> btrfs: fails without this patch, passes with it
> xfs: fails with and without this patch
>
> For that reason, I don't think this a good patch, I just mostly want to
> figure out where to go with this test!
>
> Thanks,
> Boris
>
> > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > ---
> > >  tests/generic/730 | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/tests/generic/730 b/tests/generic/730
> > > index 11308cdaa..ca5037c57 100755
> > > --- a/tests/generic/730
> > > +++ b/tests/generic/730
> > > @@ -47,6 +47,9 @@ exec 3< $SCSI_DEBUG_MNT/testfile
> > >  # delete the scsi debug device while it still has dirty data
> > >  echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
> > >
> > > +sync
> > > +echo 3 > /proc/sys/vm/drop_caches
> > > +
> > >  # try to read from the file, which should give us -EIO
> > >  cat <&3 > /dev/null
> > >
> > > --
> > > 2.43.0
> > >
> > >
>
Boris Burkov March 16, 2024, 2:40 p.m. UTC | #4
On Sat, Mar 16, 2024 at 01:08:57PM +0000, Filipe Manana wrote:
> On Fri, Mar 15, 2024 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
> >
> > On Thu, Mar 14, 2024 at 07:56:28PM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 13, 2024 at 04:47:47PM -0700, Boris Burkov wrote:
> > > > This test removes a SCSI debug device out from under a mounted
> > > > filesystem with a (probably) dirty file. This assumes that page cache
> > > > cannot save us from EIO, for a reason that I can't quite explain. In
> > > > fact, this test fails for exactly that reason, at least on btrfs.
> > > >
> > > > The original patches:
> > > >
> > > >       https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/
> > > >
> > > > refer to this passing on xfs and not btrfs, so I suspect I am missing
> > > > something. With that said, on my machine this actually fails on xfs with
> > > > and without my patch, so this is clearly not enough.
> > > >
> > > > High level, I am trying to understand what is really the expected
> > > > behavior from a filesystem under this condition and what this test is
> > > > getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
> > > > does pass with this additional syncing/cache dropping to nudge it to an
> > > > error.
> > >
> > > Does btrfs prefetch pagecache data as soon as a file opens?  Or I guess
> > > it could be that xfs trips an IO error and shuts down, and
> > > xfs_file_read_iter will return EIO after that happens.
> > >
> > > --D
> > >
> >
> > I might be wildly misunderstanding, since I'm very far from an expert on
> > the page cache, but didn't we just do a buffered write to the file? So it
> > would make sense for it to be in cache?
> >
> > At any rate, what I observed on my system was that xfs does not pass
> > this test. Curious if that is only on my system or just the current
> > state of the test.
> 
> Instead of xfs, do you mean btrfs?
> 
> For me xfs and ext4 always pass this test, i.e. the cat command fails
> with -EIO, both with and without your patch.
> 
> On the other hand, btrfs and f2fs always fail without your patch, and
> pass with your patch applied.

$ sudo ./check generic/730
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 v0 6.8.0-rc7+ #205 SMP PREEMPT_DYNAMIC Thu Mar 14 16:44:32 PDT 2024
MKFS_OPTIONS  -- -f /dev/mapper/tst-scr0
MOUNT_OPTIONS -- /dev/mapper/tst-scr0 /mnt/scratch

generic/730 1s ... [failed, exit status 1]- output mismatch (see /home/vmuser/fstests/results//generic/730.out.bad)
    --- tests/generic/730.out   2023-11-01 22:14:50.011000000 +0000
    +++ /home/vmuser/fstests/results//generic/730.out.bad       2024-03-16 14:38:02.431000000 +0000
    @@ -1,2 +1 @@
     QA output created by 730
    -cat: -: Input/output error
    ...
    (Run 'diff -u /home/vmuser/fstests/tests/generic/730.out /home/vmuser/fstests/results//generic/730.out.bad'  to see the entire diff)
Ran: generic/730
Failures: generic/730
Failed 1 of 1 tests

I suspect it has something to do with my Kconfig. Perhaps "xfs (non-debug)" is a clue?

> 
> From a btrfs point of view, it makes sense for the cat command to
> succeed, since all the file data is in the page cache.

That was my intuition as well, as I mentioned above.

> 
> I don't know why the cat command fails (-EIO) with xfs and ext4, and I
> haven't looked at any of their source code (my knowledge of those
> filesystem's internals is too little anyway).
> 
> An alternative to your patch is just to make the write with direct IO
> by passing -d to $XFS_IO_PROG. It makes the test succeed on these 4
> filesystems.
> 
> 

I like this idea a lot more than my patch!

> >
> > Summary of my test matrix:
> > ext4: passes with and without this patch
> > btrfs: fails without this patch, passes with it
> > xfs: fails with and without this patch
> >
> > For that reason, I don't think this a good patch, I just mostly want to
> > figure out where to go with this test!
> >
> > Thanks,
> > Boris
> >
> > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > ---
> > > >  tests/generic/730 | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/tests/generic/730 b/tests/generic/730
> > > > index 11308cdaa..ca5037c57 100755
> > > > --- a/tests/generic/730
> > > > +++ b/tests/generic/730
> > > > @@ -47,6 +47,9 @@ exec 3< $SCSI_DEBUG_MNT/testfile
> > > >  # delete the scsi debug device while it still has dirty data
> > > >  echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
> > > >
> > > > +sync
> > > > +echo 3 > /proc/sys/vm/drop_caches
> > > > +
> > > >  # try to read from the file, which should give us -EIO
> > > >  cat <&3 > /dev/null
> > > >
> > > > --
> > > > 2.43.0
> > > >
> > > >
> >
Filipe Manana March 16, 2024, 5:42 p.m. UTC | #5
On Sat, Mar 16, 2024 at 2:39 PM Boris Burkov <boris@bur.io> wrote:
>
> On Sat, Mar 16, 2024 at 01:08:57PM +0000, Filipe Manana wrote:
> > On Fri, Mar 15, 2024 at 11:30 PM Boris Burkov <boris@bur.io> wrote:
> > >
> > > On Thu, Mar 14, 2024 at 07:56:28PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 13, 2024 at 04:47:47PM -0700, Boris Burkov wrote:
> > > > > This test removes a SCSI debug device out from under a mounted
> > > > > filesystem with a (probably) dirty file. This assumes that page cache
> > > > > cannot save us from EIO, for a reason that I can't quite explain. In
> > > > > fact, this test fails for exactly that reason, at least on btrfs.
> > > > >
> > > > > The original patches:
> > > > >
> > > > >       https://lore.kernel.org/fstests/20230807112100.GB15405@lst.de/
> > > > >
> > > > > refer to this passing on xfs and not btrfs, so I suspect I am missing
> > > > > something. With that said, on my machine this actually fails on xfs with
> > > > > and without my patch, so this is clearly not enough.
> > > > >
> > > > > High level, I am trying to understand what is really the expected
> > > > > behavior from a filesystem under this condition and what this test is
> > > > > getting at. Of btrfs, ext4, and xfs, only ext4 passes it, while btrfs
> > > > > does pass with this additional syncing/cache dropping to nudge it to an
> > > > > error.
> > > >
> > > > Does btrfs prefetch pagecache data as soon as a file opens?  Or I guess
> > > > it could be that xfs trips an IO error and shuts down, and
> > > > xfs_file_read_iter will return EIO after that happens.
> > > >
> > > > --D
> > > >
> > >
> > > I might be wildly misunderstanding, since I'm very far from an expert on
> > > the page cache, but didn't we just do a buffered write to the file? So it
> > > would make sense for it to be in cache?
> > >
> > > At any rate, what I observed on my system was that xfs does not pass
> > > this test. Curious if that is only on my system or just the current
> > > state of the test.
> >
> > Instead of xfs, do you mean btrfs?
> >
> > For me xfs and ext4 always pass this test, i.e. the cat command fails
> > with -EIO, both with and without your patch.
> >
> > On the other hand, btrfs and f2fs always fail without your patch, and
> > pass with your patch applied.
>
> $ sudo ./check generic/730
> FSTYP         -- xfs (non-debug)
> PLATFORM      -- Linux/x86_64 v0 6.8.0-rc7+ #205 SMP PREEMPT_DYNAMIC Thu Mar 14 16:44:32 PDT 2024
> MKFS_OPTIONS  -- -f /dev/mapper/tst-scr0
> MOUNT_OPTIONS -- /dev/mapper/tst-scr0 /mnt/scratch
>
> generic/730 1s ... [failed, exit status 1]- output mismatch (see /home/vmuser/fstests/results//generic/730.out.bad)
>     --- tests/generic/730.out   2023-11-01 22:14:50.011000000 +0000
>     +++ /home/vmuser/fstests/results//generic/730.out.bad       2024-03-16 14:38:02.431000000 +0000
>     @@ -1,2 +1 @@
>      QA output created by 730
>     -cat: -: Input/output error
>     ...
>     (Run 'diff -u /home/vmuser/fstests/tests/generic/730.out /home/vmuser/fstests/results//generic/730.out.bad'  to see the entire diff)
> Ran: generic/730
> Failures: generic/730
> Failed 1 of 1 tests
>
> I suspect it has something to do with my Kconfig. Perhaps "xfs (non-debug)" is a clue?

Mine is also non-debug and yet it always succeeds for me (without your
patch or any other of course):

root 17:40:02 /home/fdmanana/git/hub/xfstests > ./check generic/730
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 debian0 6.8.0-rc7-btrfs-next-153+ #1 SMP
PREEMPT_DYNAMIC Mon Mar  4 17:19:19 WET 2024
MKFS_OPTIONS  -- -f /dev/sdb
MOUNT_OPTIONS -- /dev/sdb /home/fdmanana/btrfs-tests/scratch_1

generic/730 2s ...  2s
Ran: generic/730
Passed all 1 tests


>
> >
> > From a btrfs point of view, it makes sense for the cat command to
> > succeed, since all the file data is in the page cache.
>
> That was my intuition as well, as I mentioned above.
>
> >
> > I don't know why the cat command fails (-EIO) with xfs and ext4, and I
> > haven't looked at any of their source code (my knowledge of those
> > filesystem's internals is too little anyway).
> >
> > An alternative to your patch is just to make the write with direct IO
> > by passing -d to $XFS_IO_PROG. It makes the test succeed on these 4
> > filesystems.
> >
> >
>
> I like this idea a lot more than my patch!
>
> > >
> > > Summary of my test matrix:
> > > ext4: passes with and without this patch
> > > btrfs: fails without this patch, passes with it
> > > xfs: fails with and without this patch
> > >
> > > For that reason, I don't think this a good patch, I just mostly want to
> > > figure out where to go with this test!
> > >
> > > Thanks,
> > > Boris
> > >
> > > > > Signed-off-by: Boris Burkov <boris@bur.io>
> > > > > ---
> > > > >  tests/generic/730 | 3 +++
> > > > >  1 file changed, 3 insertions(+)
> > > > >
> > > > > diff --git a/tests/generic/730 b/tests/generic/730
> > > > > index 11308cdaa..ca5037c57 100755
> > > > > --- a/tests/generic/730
> > > > > +++ b/tests/generic/730
> > > > > @@ -47,6 +47,9 @@ exec 3< $SCSI_DEBUG_MNT/testfile
> > > > >  # delete the scsi debug device while it still has dirty data
> > > > >  echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
> > > > >
> > > > > +sync
> > > > > +echo 3 > /proc/sys/vm/drop_caches
> > > > > +
> > > > >  # try to read from the file, which should give us -EIO
> > > > >  cat <&3 > /dev/null
> > > > >
> > > > > --
> > > > > 2.43.0
> > > > >
> > > > >
> > >
Christoph Hellwig March 17, 2024, 8:38 p.m. UTC | #6
On Sat, Mar 16, 2024 at 05:42:30PM +0000, Filipe Manana wrote:
> Mine is also non-debug and yet it always succeeds for me (without your
> patch or any other of course):

Same here.

And the idea behind the test is that when the device is removed the
file system should always return an error, i.e. have a check for shutdown
fs in the read path.
diff mbox series

Patch

diff --git a/tests/generic/730 b/tests/generic/730
index 11308cdaa..ca5037c57 100755
--- a/tests/generic/730
+++ b/tests/generic/730
@@ -47,6 +47,9 @@  exec 3< $SCSI_DEBUG_MNT/testfile
 # delete the scsi debug device while it still has dirty data
 echo 1 > /sys/block/$(_short_dev $SCSI_DEBUG_DEV)/device/delete
 
+sync
+echo 3 > /proc/sys/vm/drop_caches
+
 # try to read from the file, which should give us -EIO
 cat <&3 > /dev/null