fs: clear writeback errors in inode_init_always

Message ID	fa3b4dec3452aa5696b8657ef509891908901d95.camel@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Message-ID: <fa3b4dec3452aa5696b8657ef509891908901d95.camel@kernel.org> Subject: Re: [PATCH] fs: clear writeback errors in inode_init_always From: Jeff Layton <jlayton@kernel.org> To: "Theodore Y. Ts'o" <tytso@mit.edu>, "Darrick J. Wong" <darrick.wong@oracle.com> Cc: Matthew Wilcox <willy@infradead.org>, linux-fsdevel@vger.kernel.org, fstests@vger.kernel.org, xfs <linux-xfs@vger.kernel.org> Date: Sun, 20 May 2018 07:45:31 -0400 In-Reply-To: <20180519231944.GB23448@thunk.org> References: <20180518225037.GA26206@thunk.org> <630faadb74f608aa5a42649b81657e8b62d46bc3.camel@kernel.org> <20180519152700.GB4507@magnolia> <20180519231944.GB23448@thunk.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Jeff Layton May 20, 2018, 11:45 a.m. UTC

On Sat, 2018-05-19 at 19:19 -0400, Theodore Y. Ts'o wrote:
> On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In inode_init_always(), we clear the inode mapping flags, which clears
> > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> > also clear wb_err, which means that old mapping errors can leak through
> > to new inodes.
> > 
> > This is crucial for the XFS inode allocation path because we recycle old
> > in-core inodes and we do not want error state from an old file to leak
> > into the new file.  This bug was discovered by running generic/036 and
> > generic/047 in a loop and noticing that the EIOs generated by the
> > collision of direct and buffered writes in generic/036 would survive the
> > remount between 036 and 047, and get reported to the fsyncs (on
> > different files on a reformatted fs!) in generic/047.
> > 
> > Since we're changing the semantics of inode_init_always, we must also
> > change xfs_reinit_inode to retain the writeback error state when we go
> > to recover an inode that has been torn down in the vfs but not yet
> > disposed of by XFS.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> This may fix the generic/047 failure, but alas, it does not address
> the shared/298 failure.
> 
> Jeff's theory that we may need to clear the errseq_t information after
> detaching the loop device makes sense to me; and in the case of the
> loop device, we wouldn't be initializing the inode, so your patch
> wouldn't do anything about that particular case.
> 
> 

I poked at this a bit this morning and found that if I ran generic/361
twice in a row that I'd see a failure similar to what you were seeing.
This patch seems to fix it for me.

Ted, could you test this against your reproducer? If this works for you
I'll plan to flesh out the patch description and get this into -next and
eventually to Linus before the next release.

Thanks,
Jeff

-------------------------8<--------------------

[PATCH] loop: clear wb_err in bd_inode when detaching backing file

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 drivers/block/loop.c | 1 +
 1 file changed, 1 insertion(+)

Matthew Wilcox May 20, 2018, 12:58 p.m. UTC | #1

On Sun, May 20, 2018 at 07:45:31AM -0400, Jeff Layton wrote:
> [PATCH] loop: clear wb_err in bd_inode when detaching backing file

Is this the right thing to do?  Putting the test-suite aside for the
moment, if I have a loop device on a file and I hit a real error on the
storage backing that file, I don't see why detaching the loop device
from the file should clear the error.

I'm really tempted to say that we should fix the test-suite to consume
the error; once it's been read by at least one reader, it won't go back
to zero, but neither will it show up for new readers.

Jeff Layton May 20, 2018, 1:18 p.m. UTC | #2

On Sun, 2018-05-20 at 05:58 -0700, Matthew Wilcox wrote:
> On Sun, May 20, 2018 at 07:45:31AM -0400, Jeff Layton wrote:
> > [PATCH] loop: clear wb_err in bd_inode when detaching backing file
> 
> Is this the right thing to do?  Putting the test-suite aside for the
> moment, if I have a loop device on a file and I hit a real error on the
> storage backing that file, I don't see why detaching the loop device
> from the file should clear the error.
> 
> I'm really tempted to say that we should fix the test-suite to consume
> the error; once it's been read by at least one reader, it won't go back
> to zero, but neither will it show up for new readers.

This is a bit of a grey area, for sure. FWIW, I looked at doing this in
invalidate_bdev, but backed off since I wasn't sure about the other
callers.

I still think it probably is the right thing to do. I'd think that
detaching the backing store is sort of an indicator that you really
don't care about it anymore (including any stored errors concerning it).

OTOH, my track record on predicting what applications care about is
pretty abysmal. I'll go with whatever the consensus is here...

Theodore Ts'o May 20, 2018, 4:29 p.m. UTC | #3

On Sun, May 20, 2018 at 05:58:43AM -0700, Matthew Wilcox wrote:
> On Sun, May 20, 2018 at 07:45:31AM -0400, Jeff Layton wrote:
> > [PATCH] loop: clear wb_err in bd_inode when detaching backing file
> 
> Is this the right thing to do?  Putting the test-suite aside for the
> moment, if I have a loop device on a file and I hit a real error on the
> storage backing that file, I don't see why detaching the loop device
> from the file should clear the error.
> 
> I'm really tempted to say that we should fix the test-suite to consume
> the error; once it's been read by at least one reader, it won't go back
> to zero, but neither will it show up for new readers.

You can't detach the backing store if there are any open file
descriptors (or if there is another loop device keeping the loop
device open, or if there is a file system mounted on it, etc.).

If you can detach the backing store, once you detach the backing
store, the loop device is *gone*.  The user of /dev/loop0 will very
likely be a completely different backing store, so returning an error
simply doesn't make any sense.  There is a very good chance it will be
a completely different backing store, with potentially a different
user, so returning a carried over error is just going confuse, annoy,
and anger the user.....

					- Ted

Theodore Ts'o May 20, 2018, 5:57 p.m. UTC | #4

On Sun, May 20, 2018 at 07:45:31AM -0400, Jeff Layton wrote:
> 
> I poked at this a bit this morning and found that if I ran generic/361
> twice in a row that I'd see a failure similar to what you were seeing.
> This patch seems to fix it for me.
> 
> Ted, could you test this against your reproducer? If this works for you
> I'll plan to flesh out the patch description and get this into -next and
> eventually to Linus before the next release.

Yup, this fixes shared/298 for me.   Many thanks!!

						- Ted

Matthew Wilcox May 20, 2018, 7:20 p.m. UTC | #5

On Sun, May 20, 2018 at 12:29:54PM -0400, Theodore Y. Ts'o wrote:
> On Sun, May 20, 2018 at 05:58:43AM -0700, Matthew Wilcox wrote:
> > On Sun, May 20, 2018 at 07:45:31AM -0400, Jeff Layton wrote:
> > > [PATCH] loop: clear wb_err in bd_inode when detaching backing file
> > 
> > Is this the right thing to do?  Putting the test-suite aside for the
> > moment, if I have a loop device on a file and I hit a real error on the
> > storage backing that file, I don't see why detaching the loop device
> > from the file should clear the error.
> > 
> > I'm really tempted to say that we should fix the test-suite to consume
> > the error; once it's been read by at least one reader, it won't go back
> > to zero, but neither will it show up for new readers.
> 
> You can't detach the backing store if there are any open file
> descriptors (or if there is another loop device keeping the loop
> device open, or if there is a file system mounted on it, etc.).
> 
> If you can detach the backing store, once you detach the backing
> store, the loop device is *gone*.  The user of /dev/loop0 will very
> likely be a completely different backing store, so returning an error
> simply doesn't make any sense.  There is a very good chance it will be
> a completely different backing store, with potentially a different
> user, so returning a carried over error is just going confuse, annoy,
> and anger the user.....

Oh!  I misunderstood.  I thought that bd_inode->i_mapping pointed to
lo_backing_file->f_mapping.

So how about this to preserve the error _in the file with the error_?

	if (bdev) {
		bdput(bdev);
		invalidate_bdev(bdev);
+		filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err;
+		bdev->bd_inode->i_mapping->wb_err = 0;
	}
	set_capacity(lo->lo_disk, 0);
	loop_sysfs_exit(lo);

Theodore Ts'o May 20, 2018, 7:41 p.m. UTC | #6

On Sun, May 20, 2018 at 12:20:09PM -0700, Matthew Wilcox wrote:
> Oh!  I misunderstood.  I thought that bd_inode->i_mapping pointed to
> lo_backing_file->f_mapping.
> 
> So how about this to preserve the error _in the file with the error_?
> 
> 	if (bdev) {
> 		bdput(bdev);
> 		invalidate_bdev(bdev);
> +		filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err;
> +		bdev->bd_inode->i_mapping->wb_err = 0;
> 	}
> 	set_capacity(lo->lo_disk, 0);
> 	loop_sysfs_exit(lo);

I don't think it's necessary.  wb_err on the underlying file would
have already been set when the error was set.  So if someone tried
calling fsync(2) on the underlying file, it would have collected the
error already.  Re-setting the error when we detach the loop device
doesn't seem to make any sense.

					- Ted

Jeff Layton May 21, 2018, 11:20 a.m. UTC | #7

On Sun, 2018-05-20 at 15:41 -0400, Theodore Y. Ts'o wrote:
> On Sun, May 20, 2018 at 12:20:09PM -0700, Matthew Wilcox wrote:
> > Oh!  I misunderstood.  I thought that bd_inode->i_mapping pointed to
> > lo_backing_file->f_mapping.
> > 
> > So how about this to preserve the error _in the file with the error_?
> > 
> > 	if (bdev) {
> > 		bdput(bdev);
> > 		invalidate_bdev(bdev);
> > +		filp->f_mapping->wb_err = bdev->bd_inode->i_mapping->wb_err;
> > +		bdev->bd_inode->i_mapping->wb_err = 0;
> > 	}
> > 	set_capacity(lo->lo_disk, 0);
> > 	loop_sysfs_exit(lo);
> 
> I don't think it's necessary.  wb_err on the underlying file would
> have already been set when the error was set.  So if someone tried
> calling fsync(2) on the underlying file, it would have collected the
> error already.  Re-setting the error when we detach the loop device
> doesn't seem to make any sense.
> 

(cc'ing linux-block)

Yes, the error would have been set on the original file already. As long
as lo_req_flush or __loop_update_dio weren't called before we detach the
device, it should still be possible to call fsync on the original fd and
get back the error.

As a side note, it looks like __loop_update_dio will discard an error
from vfs_fsync, so certain ioctls against a loop device can cause errors
to be lost. It seems like those ought to get propagated to userland or
to the blockdev's mapping somehow.

Theodore Ts'o May 21, 2018, 2:43 p.m. UTC | #8

On Mon, May 21, 2018 at 07:20:05AM -0400, Jeff Layton wrote:
> 
> As a side note, it looks like __loop_update_dio will discard an error
> from vfs_fsync, so certain ioctls against a loop device can cause errors
> to be lost. It seems like those ought to get propagated to userland or
> to the blockdev's mapping somehow.

I'm not sure it's worth it.  All of the ioctls that call
loop_update_dio are used just to configure the loop device.  In
practice they are never called once the loop device is actually
running.  It might be a better choice is to just define that errors
get reset if there is any attempt to reconfigure the loop device.

One of these ioctls change the offset that a loop device maps onto the
backing file.  So for example, while offset 0 of the loop device might
have corresponds beginning of the backing file, after executing the
ioctl, offset 0 of the loop device no corresponds to offset 2MB of the
backing store.  This might happen if we are resetting the loop device
to point at the first partition of a disk image, for example.  I
really don't think that preserving the error status when we are doing
that kind of radical configuration makes sense.

Or for example, when we change the block size of the loop device; if
the underlying backing store is an IBM Mainframe DASD with a 2k block
size, the reason why the error was signalled was because there was an
attempt to write a 1k block onto a 2k block device.  So again,
resetting the error status of the loop device is the right thing to
do.

The final thing that's worth perhaps exploring is whether or not we
need all of these knobs and ioctls.  In particular, LOOP_CHANGE_FD is
used by losetup.  It exists because back in prehistory, some
distribution installer needed it for some obscure corner case.  If I
remember correctly, it had to do with switching out the initramfs
floppy disk with the root file system floppy disk.  We've wasted
developer bandwidth trying to deal with syzbot complaints about that
ioctl, and it's not clear it was worth the effort, because it's not
clear any real code actually _needs_ that ioctl.  It might have been
easier to comment out the ioctl, and see if anyone screamed.

Given that loop device is maintianed mostly on the margins, and it's
not clear that lots of complicated error handling during setup is
really necessary, this is basically a please to keep things simple, or
at least no more complicated than it has to be.

Cheers,

						- Ted

fs: clear writeback errors in inode_init_always

Commit Message

Comments

Patch