diff mbox

[v4,15/27] fs: retrofit old error reporting API onto new infrastructure

Message ID 20170509154930.29524-16-jlayton@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton May 9, 2017, 3:49 p.m. UTC
Now that we have a better way to store and report errors that occur
during writeback, we need to convert the existing codebase to use it. We
could just adapt all of the filesystem code and related infrastructure
to the new API, but that's a lot of churn.

When it comes to setting errors in the mapping, filemap_set_wb_error is
a drop-in replacement for mapping_set_error. Turn that function into a
simple wrapper around the new one.

Because we want to ensure that writeback errors are always reported at
fsync time, inject filemap_report_wb_error calls much closer to the
syscall boundary, in call_fsync.

For fsync calls (and things like the nfsd equivalent), we either return
the error that the fsync operation returns, or the one returned by
filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
the latest value. This allows us to provide new fsync semantics that
will return errors that may have occurred previously and been viewed
via other file descriptors.

The final piece of the puzzle is what to do about filemap_check_errors
calls that are being called directly or via filemap_* functions. Here,
we must take a little "creative license".

Since we now handle advancing the file->f_wb_err value at the generic
filesystem layer, we no longer need those callers to clear errors out
of the mapping or advance an errseq_t.

A lot of the existing codebase relies on being getting an error back
from those functions when there is a writeback problem, so we do still
want to have them report writeback errors somehow.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can at least report
the latest error that has occurred since an arbitrary point in time
(represented as a sampled errseq_t value).

In the case where we don't have a struct file to work with, this patch
just has the wrappers sample the current mapping->wb_err value, and use
that as an arbitrary point from which to check for errors.

That's probably not "correct" in all cases, particularly in the case of
something like filemap_fdatawait, but I'm not sure it's any worse than
what we already have, and this gives us a basis from which to work.

A lot of those callers will likely want to change to a model where they
sample the errseq_t much earlier (perhaps when starting a transaction),
store it in an appropriate place and then use that value later when
checking to see if an error occurred.

That will almost certainly take some involvement from other subsystem
maintainers. I'm quite open to adding new API functions to help enable
this if that would be helpful, but I don't really want to do that until
I better understand what's needed.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 Documentation/filesystems/vfs.txt |  9 +++----
 fs/btrfs/file.c                   | 10 ++------
 fs/btrfs/tree-log.c               |  9 ++-----
 fs/f2fs/file.c                    |  3 +++
 fs/f2fs/node.c                    |  6 +----
 fs/fuse/file.c                    |  7 ++++--
 fs/libfs.c                        |  6 +++--
 include/linux/fs.h                | 19 +++++++++------
 include/linux/pagemap.h           |  8 ++----
 mm/filemap.c                      | 51 +++++++++++++++++----------------------
 10 files changed, 57 insertions(+), 71 deletions(-)

Comments

Jan Kara May 15, 2017, 10:42 a.m. UTC | #1
On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
> 
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
> 
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary, in call_fsync.
> 
> For fsync calls (and things like the nfsd equivalent), we either return
> the error that the fsync operation returns, or the one returned by
> filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> the latest value. This allows us to provide new fsync semantics that
> will return errors that may have occurred previously and been viewed
> via other file descriptors.
> 
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
> 
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to clear errors out
> of the mapping or advance an errseq_t.
> 
> A lot of the existing codebase relies on being getting an error back
> from those functions when there is a writeback problem, so we do still
> want to have them report writeback errors somehow.
> 
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled errseq_t value).
> 
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the current mapping->wb_err value, and use
> that as an arbitrary point from which to check for errors.

I think this is really dangerous and we shouldn't do this. You are quite
likely to lose IO errors in such calls because you will ignore all errors
that happened during previous background writeback or even for IO that
managed to complete before we called filemap_fdatawait(). Maybe we need to
keep the original set-clear-bit IO error reporting for these cases, until
we can convert them to fdatawait_range_since()?

> That's probably not "correct" in all cases, particularly in the case of
> something like filemap_fdatawait, but I'm not sure it's any worse than
> what we already have, and this gives us a basis from which to work.
> 
> A lot of those callers will likely want to change to a model where they
> sample the errseq_t much earlier (perhaps when starting a transaction),
> store it in an appropriate place and then use that value later when
> checking to see if an error occurred.
> 
> That will almost certainly take some involvement from other subsystem
> maintainers. I'm quite open to adding new API functions to help enable
> this if that would be helpful, but I don't really want to do that until
> I better understand what's needed.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>

...

> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 5f7317875a67..7ce13281925f 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
>  		.nr_to_write = LONG_MAX,
>  		.for_reclaim = 0,
>  	};
> +	errseq_t since = READ_ONCE(file->f_wb_err);
>  
>  	if (unlikely(f2fs_readonly(inode->i_sb)))
>  		return 0;
> @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
>  	}
>  
>  	ret = wait_on_node_pages_writeback(sbi, ino);
> +	if (ret == 0)
> +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
>  	if (ret)
>  		goto out;

So this conversion looks wrong and actually points to a larger issue with
the scheme. The problem is there are two mappings that come into play here
- file_inode(file)->i_mapping which is the data mapping and
NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
specific to f2fs. For example ext2 also uses this scheme where block
devices' mapping is the metadata mapping). And we need to merge error
information from these two mappings so for the stamping scheme to work,
we'd need two stamps stored in struct file. One for data mapping and one
for metadata mapping. Or maybe there's some more clever scheme but for now
I don't see one...

								Honza
Jeff Layton May 15, 2017, 5:58 p.m. UTC | #2
On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> > 
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> > 
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary, in call_fsync.
> > 
> > For fsync calls (and things like the nfsd equivalent), we either return
> > the error that the fsync operation returns, or the one returned by
> > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > the latest value. This allows us to provide new fsync semantics that
> > will return errors that may have occurred previously and been viewed
> > via other file descriptors.
> > 
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> > 
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to clear errors out
> > of the mapping or advance an errseq_t.
> > 
> > A lot of the existing codebase relies on being getting an error back
> > from those functions when there is a writeback problem, so we do still
> > want to have them report writeback errors somehow.
> > 
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled errseq_t value).
> > 
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the current mapping->wb_err value, and use
> > that as an arbitrary point from which to check for errors.
> 
> I think this is really dangerous and we shouldn't do this. You are quite
> likely to lose IO errors in such calls because you will ignore all errors
> that happened during previous background writeback or even for IO that
> managed to complete before we called filemap_fdatawait(). Maybe we need to
> keep the original set-clear-bit IO error reporting for these cases, until
> we can convert them to fdatawait_range_since()?
> 

Yes that is a danger.

In fact, late last week I was finally able to make my xfstest work with
btrfs, and started hitting oopses with it because of the way the error
reporting changed. I rolled up a patch to fix that (and it simplifies
the code some, IMO), but other callers of filemap_fdatawait may have
similar problems here.

I'll pick up working on this again in a more piecemeal way. I had
originally looked at doing that, but there are some problematic places
(e.g. buffer.c), where the code is shared by a bunch of different
filesystems that makes it difficult.

We may end up needing some sort of FS_WB_ERRSEQ flag to do this
correctly, at least until the transition to errseq_t is done.

> > That's probably not "correct" in all cases, particularly in the case of
> > something like filemap_fdatawait, but I'm not sure it's any worse than
> > what we already have, and this gives us a basis from which to work.
> > 
> > A lot of those callers will likely want to change to a model where they
> > sample the errseq_t much earlier (perhaps when starting a transaction),
> > store it in an appropriate place and then use that value later when
> > checking to see if an error occurred.
> > 
> > That will almost certainly take some involvement from other subsystem
> > maintainers. I'm quite open to adding new API functions to help enable
> > this if that would be helpful, but I don't really want to do that until
> > I better understand what's needed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> 
> ...
> 
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index 5f7317875a67..7ce13281925f 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> >  		.nr_to_write = LONG_MAX,
> >  		.for_reclaim = 0,
> >  	};
> > +	errseq_t since = READ_ONCE(file->f_wb_err);
> >  
> >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> >  		return 0;
> > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> >  	}
> >  
> >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > +	if (ret == 0)
> > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> >  	if (ret)
> >  		goto out;
> 
> So this conversion looks wrong and actually points to a larger issue with
> the scheme. The problem is there are two mappings that come into play here
> - file_inode(file)->i_mapping which is the data mapping and
> NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> specific to f2fs. For example ext2 also uses this scheme where block
> devices' mapping is the metadata mapping). And we need to merge error
> information from these two mappings so for the stamping scheme to work,
> we'd need two stamps stored in struct file. One for data mapping and one
> for metadata mapping. Or maybe there's some more clever scheme but for now
> I don't see one...
> 

Got it -- since there are two mappings, there are two errseq_t's and
you'd need a since cursor for each. I don't really like having to add a
second 32-bit word to struct file, but I also don't see a real
alternative right offhand. May be able to stash it in file->private_data 
for some of these filesystems.

In any case, I'll ponder how to do this in a more piecemeal way.

Thanks for all of the review so far!
Jeff Layton May 19, 2017, 7:20 p.m. UTC | #3
On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> > 
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> > 
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary, in call_fsync.
> > 
> > For fsync calls (and things like the nfsd equivalent), we either return
> > the error that the fsync operation returns, or the one returned by
> > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > the latest value. This allows us to provide new fsync semantics that
> > will return errors that may have occurred previously and been viewed
> > via other file descriptors.
> > 
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> > 
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to clear errors out
> > of the mapping or advance an errseq_t.
> > 
> > A lot of the existing codebase relies on being getting an error back
> > from those functions when there is a writeback problem, so we do still
> > want to have them report writeback errors somehow.
> > 
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled errseq_t value).
> > 
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the current mapping->wb_err value, and use
> > that as an arbitrary point from which to check for errors.
> 
> I think this is really dangerous and we shouldn't do this. You are quite
> likely to lose IO errors in such calls because you will ignore all errors
> that happened during previous background writeback or even for IO that
> managed to complete before we called filemap_fdatawait(). Maybe we need to
> keep the original set-clear-bit IO error reporting for these cases, until
> we can convert them to fdatawait_range_since()?
> 
> > That's probably not "correct" in all cases, particularly in the case of
> > something like filemap_fdatawait, but I'm not sure it's any worse than
> > what we already have, and this gives us a basis from which to work.
> > 
> > A lot of those callers will likely want to change to a model where they
> > sample the errseq_t much earlier (perhaps when starting a transaction),
> > store it in an appropriate place and then use that value later when
> > checking to see if an error occurred.
> > 
> > That will almost certainly take some involvement from other subsystem
> > maintainers. I'm quite open to adding new API functions to help enable
> > this if that would be helpful, but I don't really want to do that until
> > I better understand what's needed.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> 
> ...
> 
> > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > index 5f7317875a67..7ce13281925f 100644
> > --- a/fs/f2fs/file.c
> > +++ b/fs/f2fs/file.c
> > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> >  		.nr_to_write = LONG_MAX,
> >  		.for_reclaim = 0,
> >  	};
> > +	errseq_t since = READ_ONCE(file->f_wb_err);
> >  
> >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> >  		return 0;
> > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> >  	}
> >  
> >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > +	if (ret == 0)
> > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> >  	if (ret)
> >  		goto out;
> 
> So this conversion looks wrong and actually points to a larger issue with
> the scheme. The problem is there are two mappings that come into play here
> - file_inode(file)->i_mapping which is the data mapping and
> NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> specific to f2fs. For example ext2 also uses this scheme where block
> devices' mapping is the metadata mapping). And we need to merge error
> information from these two mappings so for the stamping scheme to work,
> we'd need two stamps stored in struct file. One for data mapping and one
> for metadata mapping. Or maybe there's some more clever scheme but for now
> I don't see one...
> 
> 								Honza

In the case of something like ext2, could we instead get away with just
marking the data mapping of the inode with an error if the metadata
writeout fails?

Then we could just have write_inode operations call mapping_set_error on
inode->i_mapping when they're going to return an error. That should be
functionally equivalent, I'd think.

The catch there is that that requires a 1:1 data:metadata mapping, and
I'm not sure that that is the case (or will always be, even if it is
now).
Jan Kara May 22, 2017, 1:38 p.m. UTC | #4
On Fri 19-05-17 15:20:52, Jeff Layton wrote:
> On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> > On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > > Now that we have a better way to store and report errors that occur
> > > during writeback, we need to convert the existing codebase to use it. We
> > > could just adapt all of the filesystem code and related infrastructure
> > > to the new API, but that's a lot of churn.
> > > 
> > > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > > a drop-in replacement for mapping_set_error. Turn that function into a
> > > simple wrapper around the new one.
> > > 
> > > Because we want to ensure that writeback errors are always reported at
> > > fsync time, inject filemap_report_wb_error calls much closer to the
> > > syscall boundary, in call_fsync.
> > > 
> > > For fsync calls (and things like the nfsd equivalent), we either return
> > > the error that the fsync operation returns, or the one returned by
> > > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > > the latest value. This allows us to provide new fsync semantics that
> > > will return errors that may have occurred previously and been viewed
> > > via other file descriptors.
> > > 
> > > The final piece of the puzzle is what to do about filemap_check_errors
> > > calls that are being called directly or via filemap_* functions. Here,
> > > we must take a little "creative license".
> > > 
> > > Since we now handle advancing the file->f_wb_err value at the generic
> > > filesystem layer, we no longer need those callers to clear errors out
> > > of the mapping or advance an errseq_t.
> > > 
> > > A lot of the existing codebase relies on being getting an error back
> > > from those functions when there is a writeback problem, so we do still
> > > want to have them report writeback errors somehow.
> > > 
> > > When reporting writeback errors, we will always report errors that have
> > > occurred since a particular point in time. With the old writeback error
> > > reporting, the time we used was "since it was last tested/cleared" which
> > > is entirely arbitrary and potentially racy. Now, we can at least report
> > > the latest error that has occurred since an arbitrary point in time
> > > (represented as a sampled errseq_t value).
> > > 
> > > In the case where we don't have a struct file to work with, this patch
> > > just has the wrappers sample the current mapping->wb_err value, and use
> > > that as an arbitrary point from which to check for errors.
> > 
> > I think this is really dangerous and we shouldn't do this. You are quite
> > likely to lose IO errors in such calls because you will ignore all errors
> > that happened during previous background writeback or even for IO that
> > managed to complete before we called filemap_fdatawait(). Maybe we need to
> > keep the original set-clear-bit IO error reporting for these cases, until
> > we can convert them to fdatawait_range_since()?
> > 
> > > That's probably not "correct" in all cases, particularly in the case of
> > > something like filemap_fdatawait, but I'm not sure it's any worse than
> > > what we already have, and this gives us a basis from which to work.
> > > 
> > > A lot of those callers will likely want to change to a model where they
> > > sample the errseq_t much earlier (perhaps when starting a transaction),
> > > store it in an appropriate place and then use that value later when
> > > checking to see if an error occurred.
> > > 
> > > That will almost certainly take some involvement from other subsystem
> > > maintainers. I'm quite open to adding new API functions to help enable
> > > this if that would be helpful, but I don't really want to do that until
> > > I better understand what's needed.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > 
> > ...
> > 
> > > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > > index 5f7317875a67..7ce13281925f 100644
> > > --- a/fs/f2fs/file.c
> > > +++ b/fs/f2fs/file.c
> > > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > >  		.nr_to_write = LONG_MAX,
> > >  		.for_reclaim = 0,
> > >  	};
> > > +	errseq_t since = READ_ONCE(file->f_wb_err);
> > >  
> > >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> > >  		return 0;
> > > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > >  	}
> > >  
> > >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > > +	if (ret == 0)
> > > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> > >  	if (ret)
> > >  		goto out;
> > 
> > So this conversion looks wrong and actually points to a larger issue with
> > the scheme. The problem is there are two mappings that come into play here
> > - file_inode(file)->i_mapping which is the data mapping and
> > NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> > specific to f2fs. For example ext2 also uses this scheme where block
> > devices' mapping is the metadata mapping). And we need to merge error
> > information from these two mappings so for the stamping scheme to work,
> > we'd need two stamps stored in struct file. One for data mapping and one
> > for metadata mapping. Or maybe there's some more clever scheme but for now
> > I don't see one...
> > 
> > 								Honza
> 
> In the case of something like ext2, could we instead get away with just
> marking the data mapping of the inode with an error if the metadata
> writeout fails?
> 
> Then we could just have write_inode operations call mapping_set_error on
> inode->i_mapping when they're going to return an error. That should be
> functionally equivalent, I'd think.
> 
> The catch there is that that requires a 1:1 data:metadata mapping, and
> I'm not sure that that is the case (or will always be, even if it is
> now).

So for ext2 / ext4 in nojournal mode this should work - we track all
relevant metadata in mapping->private_list. But I cannot really comment
on other filesystems like f2fs...

								Honza
Jeff Layton May 22, 2017, 1:53 p.m. UTC | #5
On Mon, 2017-05-22 at 15:38 +0200, Jan Kara wrote:
> On Fri 19-05-17 15:20:52, Jeff Layton wrote:
> > On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> > > On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > > > Now that we have a better way to store and report errors that occur
> > > > during writeback, we need to convert the existing codebase to use it. We
> > > > could just adapt all of the filesystem code and related infrastructure
> > > > to the new API, but that's a lot of churn.
> > > > 
> > > > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > > > a drop-in replacement for mapping_set_error. Turn that function into a
> > > > simple wrapper around the new one.
> > > > 
> > > > Because we want to ensure that writeback errors are always reported at
> > > > fsync time, inject filemap_report_wb_error calls much closer to the
> > > > syscall boundary, in call_fsync.
> > > > 
> > > > For fsync calls (and things like the nfsd equivalent), we either return
> > > > the error that the fsync operation returns, or the one returned by
> > > > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > > > the latest value. This allows us to provide new fsync semantics that
> > > > will return errors that may have occurred previously and been viewed
> > > > via other file descriptors.
> > > > 
> > > > The final piece of the puzzle is what to do about filemap_check_errors
> > > > calls that are being called directly or via filemap_* functions. Here,
> > > > we must take a little "creative license".
> > > > 
> > > > Since we now handle advancing the file->f_wb_err value at the generic
> > > > filesystem layer, we no longer need those callers to clear errors out
> > > > of the mapping or advance an errseq_t.
> > > > 
> > > > A lot of the existing codebase relies on being getting an error back
> > > > from those functions when there is a writeback problem, so we do still
> > > > want to have them report writeback errors somehow.
> > > > 
> > > > When reporting writeback errors, we will always report errors that have
> > > > occurred since a particular point in time. With the old writeback error
> > > > reporting, the time we used was "since it was last tested/cleared" which
> > > > is entirely arbitrary and potentially racy. Now, we can at least report
> > > > the latest error that has occurred since an arbitrary point in time
> > > > (represented as a sampled errseq_t value).
> > > > 
> > > > In the case where we don't have a struct file to work with, this patch
> > > > just has the wrappers sample the current mapping->wb_err value, and use
> > > > that as an arbitrary point from which to check for errors.
> > > 
> > > I think this is really dangerous and we shouldn't do this. You are quite
> > > likely to lose IO errors in such calls because you will ignore all errors
> > > that happened during previous background writeback or even for IO that
> > > managed to complete before we called filemap_fdatawait(). Maybe we need to
> > > keep the original set-clear-bit IO error reporting for these cases, until
> > > we can convert them to fdatawait_range_since()?
> > > 
> > > > That's probably not "correct" in all cases, particularly in the case of
> > > > something like filemap_fdatawait, but I'm not sure it's any worse than
> > > > what we already have, and this gives us a basis from which to work.
> > > > 
> > > > A lot of those callers will likely want to change to a model where they
> > > > sample the errseq_t much earlier (perhaps when starting a transaction),
> > > > store it in an appropriate place and then use that value later when
> > > > checking to see if an error occurred.
> > > > 
> > > > That will almost certainly take some involvement from other subsystem
> > > > maintainers. I'm quite open to adding new API functions to help enable
> > > > this if that would be helpful, but I don't really want to do that until
> > > > I better understand what's needed.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > 
> > > ...
> > > 
> > > > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > > > index 5f7317875a67..7ce13281925f 100644
> > > > --- a/fs/f2fs/file.c
> > > > +++ b/fs/f2fs/file.c
> > > > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > >  		.nr_to_write = LONG_MAX,
> > > >  		.for_reclaim = 0,
> > > >  	};
> > > > +	errseq_t since = READ_ONCE(file->f_wb_err);
> > > >  
> > > >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> > > >  		return 0;
> > > > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > >  	}
> > > >  
> > > >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > > > +	if (ret == 0)
> > > > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> > > >  	if (ret)
> > > >  		goto out;
> > > 
> > > So this conversion looks wrong and actually points to a larger issue with
> > > the scheme. The problem is there are two mappings that come into play here
> > > - file_inode(file)->i_mapping which is the data mapping and
> > > NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> > > specific to f2fs. For example ext2 also uses this scheme where block
> > > devices' mapping is the metadata mapping). And we need to merge error
> > > information from these two mappings so for the stamping scheme to work,
> > > we'd need two stamps stored in struct file. One for data mapping and one
> > > for metadata mapping. Or maybe there's some more clever scheme but for now
> > > I don't see one...
> > > 
> > > 								Honza
> > 
> > In the case of something like ext2, could we instead get away with just
> > marking the data mapping of the inode with an error if the metadata
> > writeout fails?
> > 
> > Then we could just have write_inode operations call mapping_set_error on
> > inode->i_mapping when they're going to return an error. That should be
> > functionally equivalent, I'd think.
> > 
> > The catch there is that that requires a 1:1 data:metadata mapping, and
> > I'm not sure that that is the case (or will always be, even if it is
> > now).
> 
> So for ext2 / ext4 in nojournal mode this should work - we track all
> relevant metadata in mapping->private_list. But I cannot really comment
> on other filesystems like f2fs...
> 

Actually, I think that may be problematic...

We could end up calling ext2_write_inode with sync_mode != WB_SYNC_ALL,
which just dirties the buffer without starting writeback. Then, have VM
subsystem write back the buffer due to memory pressure and have that
fail. Trying to set the error in write_inode would miss that situation.

I think we'll have to track two errseq_t's for those filesystems, like
you had originally suggested. The main question is where to put the
second errseq_t

For ext2, I stored it in file->private_data pointer since it wasn't
being used. For those filesystems that are already using private_data,
I'll just plan to add a field to whatever structure they're using.

(Maybe I could make private_data a union with an errseq_t for those that
don't use that field? That might be cleaner than the casting I'm doing
now)
Jan Kara May 22, 2017, 5:53 p.m. UTC | #6
On Mon 22-05-17 09:53:21, Jeff Layton wrote:
> On Mon, 2017-05-22 at 15:38 +0200, Jan Kara wrote:
> > On Fri 19-05-17 15:20:52, Jeff Layton wrote:
> > > On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> > > > On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > > > > Now that we have a better way to store and report errors that occur
> > > > > during writeback, we need to convert the existing codebase to use it. We
> > > > > could just adapt all of the filesystem code and related infrastructure
> > > > > to the new API, but that's a lot of churn.
> > > > > 
> > > > > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > > > > a drop-in replacement for mapping_set_error. Turn that function into a
> > > > > simple wrapper around the new one.
> > > > > 
> > > > > Because we want to ensure that writeback errors are always reported at
> > > > > fsync time, inject filemap_report_wb_error calls much closer to the
> > > > > syscall boundary, in call_fsync.
> > > > > 
> > > > > For fsync calls (and things like the nfsd equivalent), we either return
> > > > > the error that the fsync operation returns, or the one returned by
> > > > > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > > > > the latest value. This allows us to provide new fsync semantics that
> > > > > will return errors that may have occurred previously and been viewed
> > > > > via other file descriptors.
> > > > > 
> > > > > The final piece of the puzzle is what to do about filemap_check_errors
> > > > > calls that are being called directly or via filemap_* functions. Here,
> > > > > we must take a little "creative license".
> > > > > 
> > > > > Since we now handle advancing the file->f_wb_err value at the generic
> > > > > filesystem layer, we no longer need those callers to clear errors out
> > > > > of the mapping or advance an errseq_t.
> > > > > 
> > > > > A lot of the existing codebase relies on being getting an error back
> > > > > from those functions when there is a writeback problem, so we do still
> > > > > want to have them report writeback errors somehow.
> > > > > 
> > > > > When reporting writeback errors, we will always report errors that have
> > > > > occurred since a particular point in time. With the old writeback error
> > > > > reporting, the time we used was "since it was last tested/cleared" which
> > > > > is entirely arbitrary and potentially racy. Now, we can at least report
> > > > > the latest error that has occurred since an arbitrary point in time
> > > > > (represented as a sampled errseq_t value).
> > > > > 
> > > > > In the case where we don't have a struct file to work with, this patch
> > > > > just has the wrappers sample the current mapping->wb_err value, and use
> > > > > that as an arbitrary point from which to check for errors.
> > > > 
> > > > I think this is really dangerous and we shouldn't do this. You are quite
> > > > likely to lose IO errors in such calls because you will ignore all errors
> > > > that happened during previous background writeback or even for IO that
> > > > managed to complete before we called filemap_fdatawait(). Maybe we need to
> > > > keep the original set-clear-bit IO error reporting for these cases, until
> > > > we can convert them to fdatawait_range_since()?
> > > > 
> > > > > That's probably not "correct" in all cases, particularly in the case of
> > > > > something like filemap_fdatawait, but I'm not sure it's any worse than
> > > > > what we already have, and this gives us a basis from which to work.
> > > > > 
> > > > > A lot of those callers will likely want to change to a model where they
> > > > > sample the errseq_t much earlier (perhaps when starting a transaction),
> > > > > store it in an appropriate place and then use that value later when
> > > > > checking to see if an error occurred.
> > > > > 
> > > > > That will almost certainly take some involvement from other subsystem
> > > > > maintainers. I'm quite open to adding new API functions to help enable
> > > > > this if that would be helpful, but I don't really want to do that until
> > > > > I better understand what's needed.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > 
> > > > ...
> > > > 
> > > > > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > > > > index 5f7317875a67..7ce13281925f 100644
> > > > > --- a/fs/f2fs/file.c
> > > > > +++ b/fs/f2fs/file.c
> > > > > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > >  		.nr_to_write = LONG_MAX,
> > > > >  		.for_reclaim = 0,
> > > > >  	};
> > > > > +	errseq_t since = READ_ONCE(file->f_wb_err);
> > > > >  
> > > > >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> > > > >  		return 0;
> > > > > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > >  	}
> > > > >  
> > > > >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > > > > +	if (ret == 0)
> > > > > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> > > > >  	if (ret)
> > > > >  		goto out;
> > > > 
> > > > So this conversion looks wrong and actually points to a larger issue with
> > > > the scheme. The problem is there are two mappings that come into play here
> > > > - file_inode(file)->i_mapping which is the data mapping and
> > > > NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> > > > specific to f2fs. For example ext2 also uses this scheme where block
> > > > devices' mapping is the metadata mapping). And we need to merge error
> > > > information from these two mappings so for the stamping scheme to work,
> > > > we'd need two stamps stored in struct file. One for data mapping and one
> > > > for metadata mapping. Or maybe there's some more clever scheme but for now
> > > > I don't see one...
> > > > 
> > > > 								Honza
> > > 
> > > In the case of something like ext2, could we instead get away with just
> > > marking the data mapping of the inode with an error if the metadata
> > > writeout fails?
> > > 
> > > Then we could just have write_inode operations call mapping_set_error on
> > > inode->i_mapping when they're going to return an error. That should be
> > > functionally equivalent, I'd think.
> > > 
> > > The catch there is that that requires a 1:1 data:metadata mapping, and
> > > I'm not sure that that is the case (or will always be, even if it is
> > > now).
> > 
> > So for ext2 / ext4 in nojournal mode this should work - we track all
> > relevant metadata in mapping->private_list. But I cannot really comment
> > on other filesystems like f2fs...
> > 
> 
> Actually, I think that may be problematic...
> 
> We could end up calling ext2_write_inode with sync_mode != WB_SYNC_ALL,
> which just dirties the buffer without starting writeback. Then, have VM
> subsystem write back the buffer due to memory pressure and have that
> fail. Trying to set the error in write_inode would miss that situation.

Two notes here:

1) Inode is a bad example because there isn't 1:1 mapping between buffers
containing inodes and mappings - one buffer contains several inodes.
I wanted to add that for inodes specifically it does not matter as they get
special handling but actually fsync seems to be currently unreliable for
them - if we first wrote them in WB_SYNC_NONE mode, they will be just
written in bdev's page cache, but following fsync(2) will do nothing as
they will be clean. Anyway, this is unrelated problem.

2) For metadata like indirect blocks where you indeed have 1:1 mapping, you
can do the error setting in ->end_io handler based on bh->b_assoc_map and
that should do what you need, shouldn't it?

If I'm indeed right, then for buffers which have 1:1 mapping we are fine
and if we find a solution for inodes, we could avoid the second errseq_t.

								Honza
Jeff Layton May 22, 2017, 7:09 p.m. UTC | #7
On Mon, 2017-05-22 at 19:53 +0200, Jan Kara wrote:
> On Mon 22-05-17 09:53:21, Jeff Layton wrote:
> > On Mon, 2017-05-22 at 15:38 +0200, Jan Kara wrote:
> > > On Fri 19-05-17 15:20:52, Jeff Layton wrote:
> > > > On Mon, 2017-05-15 at 12:42 +0200, Jan Kara wrote:
> > > > > On Tue 09-05-17 11:49:18, Jeff Layton wrote:
> > > > > > Now that we have a better way to store and report errors that occur
> > > > > > during writeback, we need to convert the existing codebase to use it. We
> > > > > > could just adapt all of the filesystem code and related infrastructure
> > > > > > to the new API, but that's a lot of churn.
> > > > > > 
> > > > > > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > > > > > a drop-in replacement for mapping_set_error. Turn that function into a
> > > > > > simple wrapper around the new one.
> > > > > > 
> > > > > > Because we want to ensure that writeback errors are always reported at
> > > > > > fsync time, inject filemap_report_wb_error calls much closer to the
> > > > > > syscall boundary, in call_fsync.
> > > > > > 
> > > > > > For fsync calls (and things like the nfsd equivalent), we either return
> > > > > > the error that the fsync operation returns, or the one returned by
> > > > > > filemap_report_wb_error. In both cases, we advance the file->f_wb_err to
> > > > > > the latest value. This allows us to provide new fsync semantics that
> > > > > > will return errors that may have occurred previously and been viewed
> > > > > > via other file descriptors.
> > > > > > 
> > > > > > The final piece of the puzzle is what to do about filemap_check_errors
> > > > > > calls that are being called directly or via filemap_* functions. Here,
> > > > > > we must take a little "creative license".
> > > > > > 
> > > > > > Since we now handle advancing the file->f_wb_err value at the generic
> > > > > > filesystem layer, we no longer need those callers to clear errors out
> > > > > > of the mapping or advance an errseq_t.
> > > > > > 
> > > > > > A lot of the existing codebase relies on being getting an error back
> > > > > > from those functions when there is a writeback problem, so we do still
> > > > > > want to have them report writeback errors somehow.
> > > > > > 
> > > > > > When reporting writeback errors, we will always report errors that have
> > > > > > occurred since a particular point in time. With the old writeback error
> > > > > > reporting, the time we used was "since it was last tested/cleared" which
> > > > > > is entirely arbitrary and potentially racy. Now, we can at least report
> > > > > > the latest error that has occurred since an arbitrary point in time
> > > > > > (represented as a sampled errseq_t value).
> > > > > > 
> > > > > > In the case where we don't have a struct file to work with, this patch
> > > > > > just has the wrappers sample the current mapping->wb_err value, and use
> > > > > > that as an arbitrary point from which to check for errors.
> > > > > 
> > > > > I think this is really dangerous and we shouldn't do this. You are quite
> > > > > likely to lose IO errors in such calls because you will ignore all errors
> > > > > that happened during previous background writeback or even for IO that
> > > > > managed to complete before we called filemap_fdatawait(). Maybe we need to
> > > > > keep the original set-clear-bit IO error reporting for these cases, until
> > > > > we can convert them to fdatawait_range_since()?
> > > > > 
> > > > > > That's probably not "correct" in all cases, particularly in the case of
> > > > > > something like filemap_fdatawait, but I'm not sure it's any worse than
> > > > > > what we already have, and this gives us a basis from which to work.
> > > > > > 
> > > > > > A lot of those callers will likely want to change to a model where they
> > > > > > sample the errseq_t much earlier (perhaps when starting a transaction),
> > > > > > store it in an appropriate place and then use that value later when
> > > > > > checking to see if an error occurred.
> > > > > > 
> > > > > > That will almost certainly take some involvement from other subsystem
> > > > > > maintainers. I'm quite open to adding new API functions to help enable
> > > > > > this if that would be helpful, but I don't really want to do that until
> > > > > > I better understand what's needed.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > > 
> > > > > ...
> > > > > 
> > > > > > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> > > > > > index 5f7317875a67..7ce13281925f 100644
> > > > > > --- a/fs/f2fs/file.c
> > > > > > +++ b/fs/f2fs/file.c
> > > > > > @@ -187,6 +187,7 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > > >  		.nr_to_write = LONG_MAX,
> > > > > >  		.for_reclaim = 0,
> > > > > >  	};
> > > > > > +	errseq_t since = READ_ONCE(file->f_wb_err);
> > > > > >  
> > > > > >  	if (unlikely(f2fs_readonly(inode->i_sb)))
> > > > > >  		return 0;
> > > > > > @@ -265,6 +266,8 @@ static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
> > > > > >  	}
> > > > > >  
> > > > > >  	ret = wait_on_node_pages_writeback(sbi, ino);
> > > > > > +	if (ret == 0)
> > > > > > +		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
> > > > > >  	if (ret)
> > > > > >  		goto out;
> > > > > 
> > > > > So this conversion looks wrong and actually points to a larger issue with
> > > > > the scheme. The problem is there are two mappings that come into play here
> > > > > - file_inode(file)->i_mapping which is the data mapping and
> > > > > NODE_MAPPING(sbi) which is the metadata mapping (and this is not a problem
> > > > > specific to f2fs. For example ext2 also uses this scheme where block
> > > > > devices' mapping is the metadata mapping). And we need to merge error
> > > > > information from these two mappings so for the stamping scheme to work,
> > > > > we'd need two stamps stored in struct file. One for data mapping and one
> > > > > for metadata mapping. Or maybe there's some more clever scheme but for now
> > > > > I don't see one...
> > > > > 
> > > > > 								Honza
> > > > 
> > > > In the case of something like ext2, could we instead get away with just
> > > > marking the data mapping of the inode with an error if the metadata
> > > > writeout fails?
> > > > 
> > > > Then we could just have write_inode operations call mapping_set_error on
> > > > inode->i_mapping when they're going to return an error. That should be
> > > > functionally equivalent, I'd think.
> > > > 
> > > > The catch there is that that requires a 1:1 data:metadata mapping, and
> > > > I'm not sure that that is the case (or will always be, even if it is
> > > > now).
> > > 
> > > So for ext2 / ext4 in nojournal mode this should work - we track all
> > > relevant metadata in mapping->private_list. But I cannot really comment
> > > on other filesystems like f2fs...
> > > 
> > 
> > Actually, I think that may be problematic...
> > 
> > We could end up calling ext2_write_inode with sync_mode != WB_SYNC_ALL,
> > which just dirties the buffer without starting writeback. Then, have VM
> > subsystem write back the buffer due to memory pressure and have that
> > fail. Trying to set the error in write_inode would miss that situation.
> 
> Two notes here:
> 
> 1) Inode is a bad example because there isn't 1:1 mapping between buffers
> containing inodes and mappings - one buffer contains several inodes.
> I wanted to add that for inodes specifically it does not matter as they get
> special handling but actually fsync seems to be currently unreliable for
> them - if we first wrote them in WB_SYNC_NONE mode, they will be just
> written in bdev's page cache, but following fsync(2) will do nothing as
> they will be clean. Anyway, this is unrelated problem.
> 

Yes, that's what I was trying to articulate above. I'm not sure it's
unrelated though. Moving to errseq_t based handling there based on the
blockdev mapping seems like it'd solve that. That does require an extra
errseq_t though.

(I assume that on ext2 inode writeback, bh->b_page->mapping->host points
to the bdev inode?)

> 2) For metadata like indirect blocks where you indeed have 1:1 mapping, you
> can do the error setting in ->end_io handler based on bh->b_assoc_map and
> that should do what you need, shouldn't it?
> 

That would probably work, and I think the mark_buffer_write_io_error
function that I was adding should already be doing the right thing
there.

> If I'm indeed right, then for buffers which have 1:1 mapping we are fine
> and if we find a solution for inodes, we could avoid the second errseq_t.
> 

Yeah, I'm just still not seeing a good way to track error in inode
metadata writeback without an extra errseq_t though. I don't suppose
that a buffer holding inode metadata has a list of those inodes, does
it? Then we could walk the list and flag each one with the error.
Without something like that, I think we're stuck with an extra errseq_t.

BTW, thanks for the help so far. I haven't spent much time in local fs
metadata handling up until now, so this has been very helpful.
Jan Kara May 23, 2017, 9:05 a.m. UTC | #8
On Mon 22-05-17 15:09:33, Jeff Layton wrote:
> On Mon, 2017-05-22 at 19:53 +0200, Jan Kara wrote:
> > On Mon 22-05-17 09:53:21, Jeff Layton wrote:
> > > On Mon, 2017-05-22 at 15:38 +0200, Jan Kara wrote:
> > > > > In the case of something like ext2, could we instead get away with just
> > > > > marking the data mapping of the inode with an error if the metadata
> > > > > writeout fails?
> > > > > 
> > > > > Then we could just have write_inode operations call mapping_set_error on
> > > > > inode->i_mapping when they're going to return an error. That should be
> > > > > functionally equivalent, I'd think.
> > > > > 
> > > > > The catch there is that that requires a 1:1 data:metadata mapping, and
> > > > > I'm not sure that that is the case (or will always be, even if it is
> > > > > now).
> > > > 
> > > > So for ext2 / ext4 in nojournal mode this should work - we track all
> > > > relevant metadata in mapping->private_list. But I cannot really comment
> > > > on other filesystems like f2fs...
> > > > 
> > > 
> > > Actually, I think that may be problematic...
> > > 
> > > We could end up calling ext2_write_inode with sync_mode != WB_SYNC_ALL,
> > > which just dirties the buffer without starting writeback. Then, have VM
> > > subsystem write back the buffer due to memory pressure and have that
> > > fail. Trying to set the error in write_inode would miss that situation.
> > 
> > Two notes here:
> > 
> > 1) Inode is a bad example because there isn't 1:1 mapping between buffers
> > containing inodes and mappings - one buffer contains several inodes.
> > I wanted to add that for inodes specifically it does not matter as they get
> > special handling but actually fsync seems to be currently unreliable for
> > them - if we first wrote them in WB_SYNC_NONE mode, they will be just
> > written in bdev's page cache, but following fsync(2) will do nothing as
> > they will be clean. Anyway, this is unrelated problem.
> > 
> 
> Yes, that's what I was trying to articulate above. I'm not sure it's
> unrelated though. Moving to errseq_t based handling there based on the
> blockdev mapping seems like it'd solve that. That does require an extra
> errseq_t though.

Well, it might help solving the error handling case but it doesn't solve
the fundamental problem that the inode buffer even doesn't have to be
written to disk by the time fsync(2) returns.

> (I assume that on ext2 inode writeback, bh->b_page->mapping->host points
> to the bdev inode?)

Yes, it does.

> > 2) For metadata like indirect blocks where you indeed have 1:1 mapping, you
> > can do the error setting in ->end_io handler based on bh->b_assoc_map and
> > that should do what you need, shouldn't it?
> > 
> 
> That would probably work, and I think the mark_buffer_write_io_error
> function that I was adding should already be doing the right thing
> there.

Agreed.

> > If I'm indeed right, then for buffers which have 1:1 mapping we are fine
> > and if we find a solution for inodes, we could avoid the second errseq_t.
> 
> Yeah, I'm just still not seeing a good way to track error in inode
> metadata writeback without an extra errseq_t though. I don't suppose
> that a buffer holding inode metadata has a list of those inodes, does
> it? Then we could walk the list and flag each one with the error.
> Without something like that, I think we're stuck with an extra errseq_t.

No, the buffer doesn't have a list of associated inodes. For ext2/4 it is
doable to actually track down all the inodes but I don't think we want to
complicate this series by implementing such mechanism for each filesystem
that needs this. So let's start with a generic solution that uses second
errseq_t for the metadata mapping. It is somewhat rough (error in writeback
of any metadata block will fail fsync(2) for all open files) but we can later
improve on this for each fs which cares enough about better error reporting.

								Honza
diff mbox

Patch

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index ed06fb39822b..f201a77873f7 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,7 +577,7 @@  written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
 If there is an error during writeback, then the address_space should be
-marked with an error (typically using filemap_set_wb_error), in order to
+marked with an error (typically using mapping_set_error), in order to
 ensure that the error can later be reported to the application when an
 fsync is issued.
 
@@ -893,10 +893,9 @@  otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call. Filesystems that use the
-	pagecache should call filemap_report_wb_error before returning
-	to ensure that any errors that occurred during writeback are
-	reported and the file's error sequence advanced.
+  fsync: called by the fsync(2) system call. Errors that were previously
+	 recorded using mapping_set_error will automatically be returned to
+	 the application and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
 	(non-blocking) mode is enabled for a file
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..e15faf240b51 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1962,6 +1962,7 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	int ret = 0;
 	bool full_sync = 0;
 	u64 len;
+	errseq_t wb_since = READ_ONCE(file->f_wb_err);
 
 	/*
 	 * The range length can be represented by u64, we have to do the typecasts
@@ -2079,14 +2080,7 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 */
 		clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			  &BTRFS_I(inode)->runtime_flags);
-		/*
-		 * An ordered extent might have started before and completed
-		 * already with io errors, in which case the inode was not
-		 * updated and we end up here. So check the inode's mapping
-		 * flags for any errors that might have happened while doing
-		 * writeback of file data.
-		 */
-		ret = filemap_check_errors(inode->i_mapping);
+		ret = filemap_check_wb_error(inode->i_mapping, wb_since);
 		inode_unlock(inode);
 		goto out;
 	}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a59674c3e69e..d0a123dbb199 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3972,12 +3972,6 @@  static int wait_ordered_extents(struct btrfs_trans_handle *trans,
 			    test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)));
 
 		if (test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) {
-			/*
-			 * Clear the AS_EIO/AS_ENOSPC flags from the inode's
-			 * i_mapping flags, so that the next fsync won't get
-			 * an outdated io error too.
-			 */
-			filemap_check_errors(inode->i_mapping);
 			*ordered_io_error = true;
 			break;
 		}
@@ -4171,6 +4165,7 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	u64 test_gen;
 	int ret = 0;
 	int num = 0;
+	errseq_t since = filemap_sample_wb_error(inode->vfs_inode.i_mapping);
 
 	INIT_LIST_HEAD(&extents);
 
@@ -4214,7 +4209,7 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	 * without writing to the log tree and the fsync must report the
 	 * file data write error and not commit the current transaction.
 	 */
-	ret = filemap_check_errors(inode->vfs_inode.i_mapping);
+	ret = filemap_check_wb_error(inode->vfs_inode.i_mapping, since);
 	if (ret)
 		ctx->io_err = ret;
 process:
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5f7317875a67..7ce13281925f 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -187,6 +187,7 @@  static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
 		.nr_to_write = LONG_MAX,
 		.for_reclaim = 0,
 	};
+	errseq_t since = READ_ONCE(file->f_wb_err);
 
 	if (unlikely(f2fs_readonly(inode->i_sb)))
 		return 0;
@@ -265,6 +266,8 @@  static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
 	}
 
 	ret = wait_on_node_pages_writeback(sbi, ino);
+	if (ret == 0)
+		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
 	if (ret)
 		goto out;
 
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 481aa8dc79f4..b3ef9504fd8b 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1630,7 +1630,7 @@  int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	pgoff_t index = 0, end = ULONG_MAX;
 	struct pagevec pvec;
-	int ret2, ret = 0;
+	int ret = 0;
 
 	pagevec_init(&pvec, 0);
 
@@ -1658,10 +1658,6 @@  int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-
-	ret2 = filemap_check_errors(NODE_MAPPING(sbi));
-	if (!ret)
-		ret = ret2;
 	return ret;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 07d0efcb050c..e1ced9cfb090 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -398,6 +398,7 @@  static int fuse_flush(struct file *file, fl_owner_t id)
 	struct fuse_req *req;
 	struct fuse_flush_in inarg;
 	int err;
+	errseq_t since = READ_ONCE(file->f_wb_err);
 
 	if (is_bad_inode(inode))
 		return -EIO;
@@ -413,7 +414,7 @@  static int fuse_flush(struct file *file, fl_owner_t id)
 	fuse_sync_writes(inode);
 	inode_unlock(inode);
 
-	err = filemap_check_errors(file->f_mapping);
+	err = filemap_check_wb_error(file->f_mapping, since);
 	if (err)
 		return err;
 
@@ -446,6 +447,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	FUSE_ARGS(args);
 	struct fuse_fsync_in inarg;
 	int err;
+	errseq_t since;
 
 	if (is_bad_inode(inode))
 		return -EIO;
@@ -461,6 +463,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	if (err)
 		goto out;
 
+	since = READ_ONCE(file->f_wb_err);
 	fuse_sync_writes(inode);
 
 	/*
@@ -468,7 +471,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	 * filemap_write_and_wait_range() does not catch errors.
 	 * We have to do this directly after fuse_sync_writes()
 	 */
-	err = filemap_check_errors(file->f_mapping);
+	err = filemap_check_wb_error(file->f_mapping, since);
 	if (err)
 		goto out;
 
diff --git a/fs/libfs.c b/fs/libfs.c
index efd23040ab25..23319d74fa42 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -991,8 +991,10 @@  int __generic_file_fsync(struct file *file, loff_t start, loff_t end,
 
 out:
 	inode_unlock(inode);
-	err = filemap_check_errors(inode->i_mapping);
-	return ret ? ret : err;
+	if (!ret)
+		ret = filemap_check_wb_error(inode->i_mapping,
+						READ_ONCE(file->f_wb_err));
+	return ret;
 }
 EXPORT_SYMBOL(__generic_file_fsync);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bab3333e8671..e3068f3f69be 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1739,12 +1739,6 @@  static inline int call_mmap(struct file *file, struct vm_area_struct *vma)
 	return file->f_op->mmap(file, vma);
 }
 
-static inline int call_fsync(struct file *file, loff_t start, loff_t end,
-			     int datasync)
-{
-	return file->f_op->fsync(file, start, end, datasync);
-}
-
 ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 			      unsigned long nr_segs, unsigned long fast_segs,
 			      struct iovec *fast_pointer,
@@ -2512,6 +2506,8 @@  extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait(struct address_space *);
 extern void filemap_fdatawait_keep_errors(struct address_space *);
+extern int filemap_fdatawait_range_since(struct address_space *, loff_t lstart,
+				   loff_t lend, errseq_t since);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
 extern int filemap_write_and_wait(struct address_space *mapping);
@@ -2521,7 +2517,6 @@  extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
-extern int filemap_check_errors(struct address_space *mapping);
 extern int __must_check filemap_report_wb_error(struct file *file);
 
 /**
@@ -2544,6 +2539,16 @@  static inline errseq_t filemap_sample_wb_error(struct address_space *mapping)
 	return errseq_sample(&mapping->wb_err);
 }
 
+static inline int call_fsync(struct file *file, loff_t start, loff_t end,
+			     int datasync)
+{
+	int ret, ret2;
+
+	ret = file->f_op->fsync(file, start, end, datasync);
+	ret2 = filemap_report_wb_error(file);
+	return ret ? ret : ret2;
+}
+
 extern int vfs_fsync_range(struct file *file, loff_t start, loff_t end,
 			   int datasync);
 extern int vfs_fsync(struct file *file, int datasync);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 84943e8057ef..32512ffc15fa 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -14,6 +14,7 @@ 
 #include <linux/bitops.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
 #include <linux/hugetlb_inline.h>
+#include <linux/errseq.h>
 
 /*
  * Bits in mapping->flags.
@@ -30,12 +31,7 @@  enum mapping_flags {
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-	if (unlikely(error)) {
-		if (error == -ENOSPC)
-			set_bit(AS_ENOSPC, &mapping->flags);
-		else
-			set_bit(AS_EIO, &mapping->flags);
-	}
+	return errseq_set(&mapping->wb_err, error);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index ee1a798acfc1..eaec849ec8e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -36,6 +36,7 @@ 
 #include <linux/memcontrol.h>
 #include <linux/cleancache.h>
 #include <linux/rmap.h>
+#include <linux/errseq.h>
 #include "internal.h"
 
 #define CREATE_TRACE_POINTS
@@ -295,20 +296,6 @@  void delete_from_page_cache(struct page *page)
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
-int filemap_check_errors(struct address_space *mapping)
-{
-	int ret = 0;
-	/* Check for outstanding write errors */
-	if (test_bit(AS_ENOSPC, &mapping->flags) &&
-	    test_and_clear_bit(AS_ENOSPC, &mapping->flags))
-		ret = -ENOSPC;
-	if (test_bit(AS_EIO, &mapping->flags) &&
-	    test_and_clear_bit(AS_EIO, &mapping->flags))
-		ret = -EIO;
-	return ret;
-}
-EXPORT_SYMBOL(filemap_check_errors);
-
 /**
  * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
  * @mapping:	address space structure to write
@@ -418,27 +405,32 @@  static int __filemap_fdatawait_range(struct address_space *mapping,
  * @mapping:		address space structure to wait for
  * @start_byte:		offset in bytes where the range starts
  * @end_byte:		offset in bytes where the range ends (inclusive)
+ * @since:		check for errors since this errseq_t
  *
  * Walk the list of under-writeback pages of the given address space
  * in the given range and wait for all of them.  Check error status of
- * the address space and return it.
- *
- * Since the error status of the address space is cleared by this function,
- * callers are responsible for checking the return value and handling and/or
- * reporting the error.
+ * the address space vs. the since value and return it.
  */
-int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
-			    loff_t end_byte)
+int filemap_fdatawait_range_since(struct address_space *mapping,
+		loff_t start_byte, loff_t end_byte, errseq_t since)
 {
 	int ret, ret2;
 
 	ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-	ret2 = filemap_check_errors(mapping);
+	ret2 = filemap_check_wb_error(mapping, since);
 	if (!ret)
 		ret = ret2;
 
 	return ret;
 }
+EXPORT_SYMBOL(filemap_fdatawait_range_since);
+
+int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
+			    loff_t end_byte)
+{
+	return filemap_fdatawait_range_since(mapping, start_byte, end_byte,
+			filemap_sample_wb_error(mapping));
+}
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
 /**
@@ -489,6 +481,7 @@  EXPORT_SYMBOL(filemap_fdatawait);
 int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
+	errseq_t since = filemap_sample_wb_error(mapping);
 
 	if ((!dax_mapping(mapping) && mapping->nrpages) ||
 	    (dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -500,12 +493,12 @@  int filemap_write_and_wait(struct address_space *mapping)
 		 * thing (e.g. bug) happened, so we avoid waiting for it.
 		 */
 		if (err != -EIO) {
-			int err2 = filemap_fdatawait(mapping);
+			filemap_fdatawait_keep_errors(mapping);
 			if (!err)
-				err = err2;
+				err = filemap_check_wb_error(mapping, since);
 		}
 	} else {
-		err = filemap_check_errors(mapping);
+		err = filemap_check_wb_error(mapping, since);
 	}
 	return err;
 }
@@ -526,6 +519,7 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 				 loff_t lstart, loff_t lend)
 {
 	int err = 0;
+	errseq_t since = filemap_sample_wb_error(mapping);
 
 	if ((!dax_mapping(mapping) && mapping->nrpages) ||
 	    (dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -533,13 +527,12 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
-			int err2 = filemap_fdatawait_range(mapping,
-						lstart, lend);
+			__filemap_fdatawait_range(mapping, lstart, lend);
 			if (!err)
-				err = err2;
+				err = filemap_check_wb_error(mapping, since);
 		}
 	} else {
-		err = filemap_check_errors(mapping);
+		err = filemap_check_wb_error(mapping, since);
 	}
 	return err;
 }