diff mbox

fs: clear writeback errors in inode_init_always

Message ID 20180519152700.GB4507@magnolia (mailing list archive)
State New, archived
Headers show

Commit Message

Darrick J. Wong May 19, 2018, 3:27 p.m. UTC
From: Darrick J. Wong <darrick.wong@oracle.com>

In inode_init_always(), we clear the inode mapping flags, which clears
any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
also clear wb_err, which means that old mapping errors can leak through
to new inodes.

This is crucial for the XFS inode allocation path because we recycle old
in-core inodes and we do not want error state from an old file to leak
into the new file.  This bug was discovered by running generic/036 and
generic/047 in a loop and noticing that the EIOs generated by the
collision of direct and buffered writes in generic/036 would survive the
remount between 036 and 047, and get reported to the fsyncs (on
different files on a reformatted fs!) in generic/047.

Since we're changing the semantics of inode_init_always, we must also
change xfs_reinit_inode to retain the writeback error state when we go
to recover an inode that has been torn down in the vfs but not yet
disposed of by XFS.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/inode.c          |    1 +
 fs/xfs/xfs_icache.c |    2 ++
 2 files changed, 3 insertions(+)

Comments

Matthew Wilcox (Oracle) May 19, 2018, 3:36 p.m. UTC | #1
On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In inode_init_always(), we clear the inode mapping flags, which clears
> any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not

typo of ENOSPC in case you do a new version

> also clear wb_err, which means that old mapping errors can leak through
> to new inodes.
> 
> This is crucial for the XFS inode allocation path because we recycle old
> in-core inodes and we do not want error state from an old file to leak
> into the new file.  This bug was discovered by running generic/036 and
> generic/047 in a loop and noticing that the EIOs generated by the
> collision of direct and buffered writes in generic/036 would survive the
> remount between 036 and 047, and get reported to the fsyncs (on
> different files on a reformatted fs!) in generic/047.
> 
> Since we're changing the semantics of inode_init_always, we must also
> change xfs_reinit_inode to retain the writeback error state when we go
> to recover an inode that has been torn down in the vfs but not yet
> disposed of by XFS.

Don't you also need to preserve inode->i_mapping->flags across the
reinitialisation for the users which aren't yet using ->wb_err?

> +++ b/fs/xfs/xfs_icache.c
> @@ -298,6 +298,7 @@ xfs_reinit_inode(
>  	uint64_t	version = inode_peek_iversion(inode);
>  	umode_t		mode = inode->i_mode;
>  	dev_t		dev = inode->i_rdev;
> +	errseq_t	old_err = inode->i_mapping->wb_err;
>  
>  	error = inode_init_always(mp->m_super, inode);
>  
> @@ -306,6 +307,7 @@ xfs_reinit_inode(
>  	inode_set_iversion_queried(inode, version);
>  	inode->i_mode = mode;
>  	inode->i_rdev = dev;
> +	inode->i_mapping->wb_err = old_err;
>  	return error;
>  }
>
Theodore Ts'o May 19, 2018, 11:19 p.m. UTC | #2
On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> In inode_init_always(), we clear the inode mapping flags, which clears
> any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> also clear wb_err, which means that old mapping errors can leak through
> to new inodes.
> 
> This is crucial for the XFS inode allocation path because we recycle old
> in-core inodes and we do not want error state from an old file to leak
> into the new file.  This bug was discovered by running generic/036 and
> generic/047 in a loop and noticing that the EIOs generated by the
> collision of direct and buffered writes in generic/036 would survive the
> remount between 036 and 047, and get reported to the fsyncs (on
> different files on a reformatted fs!) in generic/047.
> 
> Since we're changing the semantics of inode_init_always, we must also
> change xfs_reinit_inode to retain the writeback error state when we go
> to recover an inode that has been torn down in the vfs but not yet
> disposed of by XFS.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

This may fix the generic/047 failure, but alas, it does not address
the shared/298 failure.

Jeff's theory that we may need to clear the errseq_t information after
detaching the loop device makes sense to me; and in the case of the
loop device, we wouldn't be initializing the inode, so your patch
wouldn't do anything about that particular case.

						- Ted
Darrick J. Wong May 21, 2018, 5:54 p.m. UTC | #3
On Sat, May 19, 2018 at 08:36:19AM -0700, Matthew Wilcox wrote:
> On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > In inode_init_always(), we clear the inode mapping flags, which clears
> > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> 
> typo of ENOSPC in case you do a new version

Fixed, thanks.

> > also clear wb_err, which means that old mapping errors can leak through
> > to new inodes.
> > 
> > This is crucial for the XFS inode allocation path because we recycle old
> > in-core inodes and we do not want error state from an old file to leak
> > into the new file.  This bug was discovered by running generic/036 and
> > generic/047 in a loop and noticing that the EIOs generated by the
> > collision of direct and buffered writes in generic/036 would survive the
> > remount between 036 and 047, and get reported to the fsyncs (on
> > different files on a reformatted fs!) in generic/047.
> > 
> > Since we're changing the semantics of inode_init_always, we must also
> > change xfs_reinit_inode to retain the writeback error state when we go
> > to recover an inode that has been torn down in the vfs but not yet
> > disposed of by XFS.
> 
> Don't you also need to preserve inode->i_mapping->flags across the
> reinitialisation for the users which aren't yet using ->wb_err?

At first I thought (mistakenly) that xfs had been completely converted,
but digging further we still use the old filemap_check_errors.  It seems
reasonable to me that if we're going to resurrect an incore inode then
we should try to hang on to AS_EIO/AS_ENOSPC for as long as we can.

--D

> > +++ b/fs/xfs/xfs_icache.c
> > @@ -298,6 +298,7 @@ xfs_reinit_inode(
> >  	uint64_t	version = inode_peek_iversion(inode);
> >  	umode_t		mode = inode->i_mode;
> >  	dev_t		dev = inode->i_rdev;
> > +	errseq_t	old_err = inode->i_mapping->wb_err;
> >  
> >  	error = inode_init_always(mp->m_super, inode);
> >  
> > @@ -306,6 +307,7 @@ xfs_reinit_inode(
> >  	inode_set_iversion_queried(inode, version);
> >  	inode->i_mode = mode;
> >  	inode->i_rdev = dev;
> > +	inode->i_mapping->wb_err = old_err;
> >  	return error;
> >  }
> >  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton May 22, 2018, 10:30 a.m. UTC | #4
On Mon, 2018-05-21 at 10:54 -0700, Darrick J. Wong wrote:
> On Sat, May 19, 2018 at 08:36:19AM -0700, Matthew Wilcox wrote:
> > On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > In inode_init_always(), we clear the inode mapping flags, which clears
> > > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> > 
> > typo of ENOSPC in case you do a new version
> 
> Fixed, thanks.
> 
> > > also clear wb_err, which means that old mapping errors can leak through
> > > to new inodes.
> > > 
> > > This is crucial for the XFS inode allocation path because we recycle old
> > > in-core inodes and we do not want error state from an old file to leak
> > > into the new file.  This bug was discovered by running generic/036 and
> > > generic/047 in a loop and noticing that the EIOs generated by the
> > > collision of direct and buffered writes in generic/036 would survive the
> > > remount between 036 and 047, and get reported to the fsyncs (on
> > > different files on a reformatted fs!) in generic/047.
> > > 
> > > Since we're changing the semantics of inode_init_always, we must also
> > > change xfs_reinit_inode to retain the writeback error state when we go
> > > to recover an inode that has been torn down in the vfs but not yet
> > > disposed of by XFS.
> > 
> > Don't you also need to preserve inode->i_mapping->flags across the
> > reinitialisation for the users which aren't yet using ->wb_err?
> 
> At first I thought (mistakenly) that xfs had been completely converted,
> but digging further we still use the old filemap_check_errors.  It seems
> reasonable to me that if we're going to resurrect an incore inode then
> we should try to hang on to AS_EIO/AS_ENOSPC for as long as we can.
> 
> 

Yes, I think the patch you sent makes sense.

Most of the XFS callers are using filemap_write_and_wait{_range}, and
most of those seem to be called from ioctls (->setattr op being the
notable exception).

We could (in principle) pass a pointer to file->f_wb_err to most of
those callers, and use that to check for errors instead of looking for
AS_EIO/AS_ENOSPC. We wouldn't want to advance the error cursor for those
(as that should only be done in the context of an fsync), but it might
be more reliable than using the flags here.

> > > +++ b/fs/xfs/xfs_icache.c
> > > @@ -298,6 +298,7 @@ xfs_reinit_inode(
> > >  	uint64_t	version = inode_peek_iversion(inode);
> > >  	umode_t		mode = inode->i_mode;
> > >  	dev_t		dev = inode->i_rdev;
> > > +	errseq_t	old_err = inode->i_mapping->wb_err;
> > >  
> > >  	error = inode_init_always(mp->m_super, inode);
> > >  
> > > @@ -306,6 +307,7 @@ xfs_reinit_inode(
> > >  	inode_set_iversion_queried(inode, version);
> > >  	inode->i_mode = mode;
> > >  	inode->i_rdev = dev;
> > > +	inode->i_mapping->wb_err = old_err;
> > >  	return error;
> > >  }
> > >  
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner May 22, 2018, 10:09 p.m. UTC | #5
On Tue, May 22, 2018 at 06:30:40AM -0400, Jeff Layton wrote:
> On Mon, 2018-05-21 at 10:54 -0700, Darrick J. Wong wrote:
> > On Sat, May 19, 2018 at 08:36:19AM -0700, Matthew Wilcox wrote:
> > > On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > In inode_init_always(), we clear the inode mapping flags, which clears
> > > > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> > > 
> > > typo of ENOSPC in case you do a new version
> > 
> > Fixed, thanks.
> > 
> > > > also clear wb_err, which means that old mapping errors can leak through
> > > > to new inodes.
> > > > 
> > > > This is crucial for the XFS inode allocation path because we recycle old
> > > > in-core inodes and we do not want error state from an old file to leak
> > > > into the new file.  This bug was discovered by running generic/036 and
> > > > generic/047 in a loop and noticing that the EIOs generated by the
> > > > collision of direct and buffered writes in generic/036 would survive the
> > > > remount between 036 and 047, and get reported to the fsyncs (on
> > > > different files on a reformatted fs!) in generic/047.
> > > > 
> > > > Since we're changing the semantics of inode_init_always, we must also
> > > > change xfs_reinit_inode to retain the writeback error state when we go
> > > > to recover an inode that has been torn down in the vfs but not yet
> > > > disposed of by XFS.
> > > 
> > > Don't you also need to preserve inode->i_mapping->flags across the
> > > reinitialisation for the users which aren't yet using ->wb_err?
> > 
> > At first I thought (mistakenly) that xfs had been completely converted,
> > but digging further we still use the old filemap_check_errors.  It seems
> > reasonable to me that if we're going to resurrect an incore inode then
> > we should try to hang on to AS_EIO/AS_ENOSPC for as long as we can.
> > 
> > 
> 
> Yes, I think the patch you sent makes sense.
> 
> Most of the XFS callers are using filemap_write_and_wait{_range}, and
> most of those seem to be called from ioctls (->setattr op being the
> notable exception).
> 
> We could (in principle) pass a pointer to file->f_wb_err to most of
> those callers, and use that to check for errors instead of looking for
> AS_EIO/AS_ENOSPC. We wouldn't want to advance the error cursor for those
> (as that should only be done in the context of an fsync), but it might
> be more reliable than using the flags here.

Why wouldn't we advance the error pointer? The data writeback error
caused an operation to fail and the error has been reported to
userspace, so how is that different to fsync() failing and reporting
the error to userspace?

This seems like a slippery slope of inconsistency to me, where data
writeback errors are only reported once on fsync(), but can be
reported multiple times through different filesystem operations...

Cheers,

Dave.
Jeff Layton May 23, 2018, 10:56 a.m. UTC | #6
On Wed, 2018-05-23 at 08:09 +1000, Dave Chinner wrote:
> On Tue, May 22, 2018 at 06:30:40AM -0400, Jeff Layton wrote:
> > On Mon, 2018-05-21 at 10:54 -0700, Darrick J. Wong wrote:
> > > On Sat, May 19, 2018 at 08:36:19AM -0700, Matthew Wilcox wrote:
> > > > On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > In inode_init_always(), we clear the inode mapping flags, which clears
> > > > > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> > > > 
> > > > typo of ENOSPC in case you do a new version
> > > 
> > > Fixed, thanks.
> > > 
> > > > > also clear wb_err, which means that old mapping errors can leak through
> > > > > to new inodes.
> > > > > 
> > > > > This is crucial for the XFS inode allocation path because we recycle old
> > > > > in-core inodes and we do not want error state from an old file to leak
> > > > > into the new file.  This bug was discovered by running generic/036 and
> > > > > generic/047 in a loop and noticing that the EIOs generated by the
> > > > > collision of direct and buffered writes in generic/036 would survive the
> > > > > remount between 036 and 047, and get reported to the fsyncs (on
> > > > > different files on a reformatted fs!) in generic/047.
> > > > > 
> > > > > Since we're changing the semantics of inode_init_always, we must also
> > > > > change xfs_reinit_inode to retain the writeback error state when we go
> > > > > to recover an inode that has been torn down in the vfs but not yet
> > > > > disposed of by XFS.
> > > > 
> > > > Don't you also need to preserve inode->i_mapping->flags across the
> > > > reinitialisation for the users which aren't yet using ->wb_err?
> > > 
> > > At first I thought (mistakenly) that xfs had been completely converted,
> > > but digging further we still use the old filemap_check_errors.  It seems
> > > reasonable to me that if we're going to resurrect an incore inode then
> > > we should try to hang on to AS_EIO/AS_ENOSPC for as long as we can.
> > > 
> > > 
> > 
> > Yes, I think the patch you sent makes sense.
> > 
> > Most of the XFS callers are using filemap_write_and_wait{_range}, and
> > most of those seem to be called from ioctls (->setattr op being the
> > notable exception).
> > 
> > We could (in principle) pass a pointer to file->f_wb_err to most of
> > those callers, and use that to check for errors instead of looking for
> > AS_EIO/AS_ENOSPC. We wouldn't want to advance the error cursor for those
> > (as that should only be done in the context of an fsync), but it might
> > be more reliable than using the flags here.
> 
> Why wouldn't we advance the error pointer? The data writeback error
> caused an operation to fail and the error has been reported to
> userspace, so how is that different to fsync() failing and reporting
> the error to userspace?
> 
> This seems like a slippery slope of inconsistency to me, where data
> writeback errors are only reported once on fsync(), but can be
> reported multiple times through different filesystem operations...
> 

If I get back an error on ioctl, why should I think that that has
anything to do with writeback? For example, suppose I do:

write = 0
ioctl = -EIO
fsync = 0

The ioctl returns an error but the fsync returns 0. It would be
reasonable to assume that my writes had made it to disk.

I think that we really do want to ensure that fsync always returns an
error if writeback failed since the last fsync. That was sort of the
point of the errseq_t work in the first place.
Dave Chinner May 24, 2018, 3:59 a.m. UTC | #7
On Wed, May 23, 2018 at 06:56:00AM -0400, Jeff Layton wrote:
> On Wed, 2018-05-23 at 08:09 +1000, Dave Chinner wrote:
> > On Tue, May 22, 2018 at 06:30:40AM -0400, Jeff Layton wrote:
> > > On Mon, 2018-05-21 at 10:54 -0700, Darrick J. Wong wrote:
> > > > On Sat, May 19, 2018 at 08:36:19AM -0700, Matthew Wilcox wrote:
> > > > > On Sat, May 19, 2018 at 08:27:00AM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > In inode_init_always(), we clear the inode mapping flags, which clears
> > > > > > any retained error (AS_EIO, AS_ENOSC) bits.  Unfortunately, we do not
> > > > > 
> > > > > typo of ENOSPC in case you do a new version
> > > > 
> > > > Fixed, thanks.
> > > > 
> > > > > > also clear wb_err, which means that old mapping errors can leak through
> > > > > > to new inodes.
> > > > > > 
> > > > > > This is crucial for the XFS inode allocation path because we recycle old
> > > > > > in-core inodes and we do not want error state from an old file to leak
> > > > > > into the new file.  This bug was discovered by running generic/036 and
> > > > > > generic/047 in a loop and noticing that the EIOs generated by the
> > > > > > collision of direct and buffered writes in generic/036 would survive the
> > > > > > remount between 036 and 047, and get reported to the fsyncs (on
> > > > > > different files on a reformatted fs!) in generic/047.
> > > > > > 
> > > > > > Since we're changing the semantics of inode_init_always, we must also
> > > > > > change xfs_reinit_inode to retain the writeback error state when we go
> > > > > > to recover an inode that has been torn down in the vfs but not yet
> > > > > > disposed of by XFS.
> > > > > 
> > > > > Don't you also need to preserve inode->i_mapping->flags across the
> > > > > reinitialisation for the users which aren't yet using ->wb_err?
> > > > 
> > > > At first I thought (mistakenly) that xfs had been completely converted,
> > > > but digging further we still use the old filemap_check_errors.  It seems
> > > > reasonable to me that if we're going to resurrect an incore inode then
> > > > we should try to hang on to AS_EIO/AS_ENOSPC for as long as we can.
> > > > 
> > > > 
> > > 
> > > Yes, I think the patch you sent makes sense.
> > > 
> > > Most of the XFS callers are using filemap_write_and_wait{_range}, and
> > > most of those seem to be called from ioctls (->setattr op being the
> > > notable exception).
> > > 
> > > We could (in principle) pass a pointer to file->f_wb_err to most of
> > > those callers, and use that to check for errors instead of looking for
> > > AS_EIO/AS_ENOSPC. We wouldn't want to advance the error cursor for those
> > > (as that should only be done in the context of an fsync), but it might
> > > be more reliable than using the flags here.
> > 
> > Why wouldn't we advance the error pointer? The data writeback error
> > caused an operation to fail and the error has been reported to
> > userspace, so how is that different to fsync() failing and reporting
> > the error to userspace?
> > 
> > This seems like a slippery slope of inconsistency to me, where data
> > writeback errors are only reported once on fsync(), but can be
> > reported multiple times through different filesystem operations...
> > 
> 
> If I get back an error on ioctl, why should I think that that has
> anything to do with writeback? For example, suppose I do:
> 
> write = 0
> ioctl = -EIO
> fsync = 0
> 
> The ioctl returns an error but the fsync returns 0. It would be
> reasonable to assume that my writes had made it to disk.
> 
> I think that we really do want to ensure that fsync always returns an
> error if writeback failed since the last fsync. That was sort of the
> point of the errseq_t work in the first place.

This needs documentation explaining all these intricacies. Because
I'm certain that we'll get it wrong again if the expected behaviour
isn't all spelled out very clearly.

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 13ceb98c3bd3..3b55391072f3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -178,6 +178,7 @@  int inode_init_always(struct super_block *sb, struct inode *inode)
 	mapping->a_ops = &empty_aops;
 	mapping->host = inode;
 	mapping->flags = 0;
+	mapping->wb_err = 0;
 	atomic_set(&mapping->i_mmap_writable, 0);
 	mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE);
 	mapping->private_data = NULL;
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 164350d91efc..f4d672cb60a4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -298,6 +298,7 @@  xfs_reinit_inode(
 	uint64_t	version = inode_peek_iversion(inode);
 	umode_t		mode = inode->i_mode;
 	dev_t		dev = inode->i_rdev;
+	errseq_t	old_err = inode->i_mapping->wb_err;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -306,6 +307,7 @@  xfs_reinit_inode(
 	inode_set_iversion_queried(inode, version);
 	inode->i_mode = mode;
 	inode->i_rdev = dev;
+	inode->i_mapping->wb_err = old_err;
 	return error;
 }