diff mbox

[v2,08/17] fs: retrofit old error reporting API onto new infrastructure

Message ID 20170412120614.6111-9-jlayton@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton April 12, 2017, 12:06 p.m. UTC
Now that we have a better way to store and report errors that occur
during writeback, we need to convert the existing codebase to use it. We
could just adapt all of the filesystem code and related infrastructure
to the new API, but that's a lot of churn.

When it comes to setting errors in the mapping, filemap_set_wb_error is
a drop-in replacement for mapping_set_error. Turn that function into a
simple wrapper around the new one.

Because we want to ensure that writeback errors are always reported at
fsync time, inject filemap_report_wb_error calls much closer to the
syscall boundary. For fsync calls (and things like the nfsd equivalent),
we either return the error that the fsync operation returns, or the one
returned by filemap_report_wb_error. In both cases, we advance the
file->f_wb_err to the latest value.

The final piece of the puzzle is what to do about filemap_check_errors
calls that are being called directly or via filemap_* functions. Here,
we must take a little "creative license".

Since we now handle advancing the file->f_wb_err value at the generic
filesystem layer, we no longer need those callers to do anything like
that, or clear errors out of the mapping. A lot of the existing codebase
relies on being getting an error back from those functions when there is
a writeback problem, so we do still want to have them report writeback
errors somehow.

When reporting writeback errors, we will always report errors that have
occurred since a particular point in time. With the old writeback error
reporting, the time we used was "since it was last tested/cleared" which
is entirely arbitrary and potentially racy. Now, we can at least report
the latest error that has occurred since an arbitrary point in time
(represented as a sampled wb_err_t value).

In the case where we don't have a struct file to work with, this patch
just has the wrappers sample the mapping->wb_err value, and use that as
an arbitrary point from which to check for errors.

That's probably not ideal, particularly in the case of something like
filemap_fdatawait, but I'm not sure it's any worse than what we
already have, and this gives us a basis from which to work.

A lot of those callers will likely want to change to a model where they
sample a wb_err_t much earlier (perhaps when starting a transaction) and
then use that when checking to see if an error occurred later. I'm open
to adding new functions to the API to enable that in later patches.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 Documentation/filesystems/vfs.txt |  9 ++++-----
 fs/btrfs/file.c                   | 10 ++--------
 fs/btrfs/tree-log.c               |  9 ++-------
 fs/ext2/file.c                    |  3 ++-
 fs/f2fs/file.c                    |  3 +++
 fs/f2fs/node.c                    |  6 +-----
 fs/fuse/file.c                    |  7 +++++--
 fs/sync.c                         |  5 ++++-
 include/linux/fs.h                |  1 -
 include/linux/pagemap.h           |  7 +------
 ipc/shm.c                         |  5 ++++-
 mm/filemap.c                      | 32 ++++++++++----------------------
 12 files changed, 38 insertions(+), 59 deletions(-)

Comments

NeilBrown April 12, 2017, 10:14 p.m. UTC | #1
On Wed, Apr 12 2017, Jeff Layton wrote:

> Now that we have a better way to store and report errors that occur
> during writeback, we need to convert the existing codebase to use it. We
> could just adapt all of the filesystem code and related infrastructure
> to the new API, but that's a lot of churn.
>
> When it comes to setting errors in the mapping, filemap_set_wb_error is
> a drop-in replacement for mapping_set_error. Turn that function into a
> simple wrapper around the new one.
>
> Because we want to ensure that writeback errors are always reported at
> fsync time, inject filemap_report_wb_error calls much closer to the
> syscall boundary. For fsync calls (and things like the nfsd equivalent),
> we either return the error that the fsync operation returns, or the one
> returned by filemap_report_wb_error. In both cases, we advance the
> file->f_wb_err to the latest value.
>
> The final piece of the puzzle is what to do about filemap_check_errors
> calls that are being called directly or via filemap_* functions. Here,
> we must take a little "creative license".
>
> Since we now handle advancing the file->f_wb_err value at the generic
> filesystem layer, we no longer need those callers to do anything like
> that, or clear errors out of the mapping. A lot of the existing codebase
> relies on being getting an error back from those functions when there is
> a writeback problem, so we do still want to have them report writeback
> errors somehow.
>
> When reporting writeback errors, we will always report errors that have
> occurred since a particular point in time. With the old writeback error
> reporting, the time we used was "since it was last tested/cleared" which
> is entirely arbitrary and potentially racy. Now, we can at least report
> the latest error that has occurred since an arbitrary point in time
> (represented as a sampled wb_err_t value).
>
> In the case where we don't have a struct file to work with, this patch
> just has the wrappers sample the mapping->wb_err value, and use that as
> an arbitrary point from which to check for errors.
>
> That's probably not ideal, particularly in the case of something like
> filemap_fdatawait, but I'm not sure it's any worse than what we
> already have, and this gives us a basis from which to work.

This aspect of the patch looked rather odd -- sampling a wb_err_t at
some arbitrary time, and then comparing it later.  So that for going to
the trouble of explaining it.

I suspect that the filemap_check_wb_error() will need to be moved
into some parent of the current call site, which is essentially what you
suggest below.  It would be nice if we could do that first, rather than
having the current rather odd code.  But maybe this way is an easier
transition.  It isn't obviously wrong, it just isn't obviously right
either.

>  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
>  {
> +	int ret, ret2;
>  	struct inode *inode = file->f_mapping->host;
>  
>  	if (!file->f_op->fsync)
> @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
>  		spin_unlock(&inode->i_lock);
>  		mark_inode_dirty_sync(inode);
>  	}
> -	return call_fsync(file, start, end, datasync);
> +	ret = call_fsync(file, start, end, datasync);
> +	ret2 = filemap_report_wb_error(file);
> +	return ret ? : ret2;
>  }
>  EXPORT_SYMBOL(vfs_fsync_range);
>  
....

>  static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>  {
> +	int ret, ret2;
>  	struct shm_file_data *sfd = shm_file_data(file);
>  
>  	if (!sfd->file->f_op->fsync)
>  		return -EINVAL;
> -	return call_fsync(sfd->file, start, end, datasync);
> +	ret = call_fsync(sfd->file, start, end, datasync);
> +	ret2 = filemap_report_wb_error(file);
> +	return ret ? : ret2;
>  }

These are the only two places that call_fsync() is called.
Did you consider moving the filemap_report_wb_error() call into
call_fsync() ??

Thanks,
NeilBrown
Jeff Layton April 12, 2017, 10:41 p.m. UTC | #2
On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
> 
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> > 
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> > 
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary. For fsync calls (and things like the nfsd equivalent),
> > we either return the error that the fsync operation returns, or the one
> > returned by filemap_report_wb_error. In both cases, we advance the
> > file->f_wb_err to the latest value.
> > 
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> > 
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to do anything like
> > that, or clear errors out of the mapping. A lot of the existing codebase
> > relies on being getting an error back from those functions when there is
> > a writeback problem, so we do still want to have them report writeback
> > errors somehow.
> > 
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled wb_err_t value).
> > 
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the mapping->wb_err value, and use that as
> > an arbitrary point from which to check for errors.
> > 
> > That's probably not ideal, particularly in the case of something like
> > filemap_fdatawait, but I'm not sure it's any worse than what we
> > already have, and this gives us a basis from which to work.
> 
> This aspect of the patch looked rather odd -- sampling a wb_err_t at
> some arbitrary time, and then comparing it later.  So that for going to
> the trouble of explaining it.
> 
> I suspect that the filemap_check_wb_error() will need to be moved
> into some parent of the current call site, which is essentially what you
> suggest below.  It would be nice if we could do that first, rather than
> having the current rather odd code.  But maybe this way is an easier
> transition.  It isn't obviously wrong, it just isn't obviously right
> either.
> 

Yeah. It's just such a daunting task to have to change so much of the
existing code. I'm looking for ways to make this simpler.

I think it probably is reasonable for filemap_write_and_wait* to just
sample it as early as possible in those functions. filemap_fdatawait is
the real questionable one, as you may have already had some writebacks
complete with errors.

In any case, my thinking was that the old code is not obviously correct
either, so while this shortens the "error capture window" on these
calls, it seems like a reasonable place to start improving things.

I'm happy to with the filesystem implementors to pick more sensible
places to grab the wb_err_t, and/or add APIs that help enable doing
this correctly.

> >  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> >  {
> > +	int ret, ret2;
> >  	struct inode *inode = file->f_mapping->host;
> >  
> >  	if (!file->f_op->fsync)
> > @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> >  		spin_unlock(&inode->i_lock);
> >  		mark_inode_dirty_sync(inode);
> >  	}
> > -	return call_fsync(file, start, end, datasync);
> > +	ret = call_fsync(file, start, end, datasync);
> > +	ret2 = filemap_report_wb_error(file);
> > +	return ret ? : ret2;
> >  }
> >  EXPORT_SYMBOL(vfs_fsync_range);
> >  
> 
> ....
> 
> >  static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >  {
> > +	int ret, ret2;
> >  	struct shm_file_data *sfd = shm_file_data(file);
> >  
> >  	if (!sfd->file->f_op->fsync)
> >  		return -EINVAL;
> > -	return call_fsync(sfd->file, start, end, datasync);
> > +	ret = call_fsync(sfd->file, start, end, datasync);
> > +	ret2 = filemap_report_wb_error(file);
> > +	return ret ? : ret2;
> >  }
> 
> These are the only two places that call_fsync() is called.
> Did you consider moving the filemap_report_wb_error() call into
> call_fsync() ??
> 

I did, and then I thought "do I really want that part in the static
inline?" Of course, that's effectively the same thing, so I'll move it
into call_fsync in the next patch.
Jeff Layton April 17, 2017, 3:17 p.m. UTC | #3
On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
> 
> > Now that we have a better way to store and report errors that occur
> > during writeback, we need to convert the existing codebase to use it. We
> > could just adapt all of the filesystem code and related infrastructure
> > to the new API, but that's a lot of churn.
> > 
> > When it comes to setting errors in the mapping, filemap_set_wb_error is
> > a drop-in replacement for mapping_set_error. Turn that function into a
> > simple wrapper around the new one.
> > 
> > Because we want to ensure that writeback errors are always reported at
> > fsync time, inject filemap_report_wb_error calls much closer to the
> > syscall boundary. For fsync calls (and things like the nfsd equivalent),
> > we either return the error that the fsync operation returns, or the one
> > returned by filemap_report_wb_error. In both cases, we advance the
> > file->f_wb_err to the latest value.
> > 
> > The final piece of the puzzle is what to do about filemap_check_errors
> > calls that are being called directly or via filemap_* functions. Here,
> > we must take a little "creative license".
> > 
> > Since we now handle advancing the file->f_wb_err value at the generic
> > filesystem layer, we no longer need those callers to do anything like
> > that, or clear errors out of the mapping. A lot of the existing codebase
> > relies on being getting an error back from those functions when there is
> > a writeback problem, so we do still want to have them report writeback
> > errors somehow.
> > 
> > When reporting writeback errors, we will always report errors that have
> > occurred since a particular point in time. With the old writeback error
> > reporting, the time we used was "since it was last tested/cleared" which
> > is entirely arbitrary and potentially racy. Now, we can at least report
> > the latest error that has occurred since an arbitrary point in time
> > (represented as a sampled wb_err_t value).
> > 
> > In the case where we don't have a struct file to work with, this patch
> > just has the wrappers sample the mapping->wb_err value, and use that as
> > an arbitrary point from which to check for errors.
> > 
> > That's probably not ideal, particularly in the case of something like
> > filemap_fdatawait, but I'm not sure it's any worse than what we
> > already have, and this gives us a basis from which to work.
> 
> This aspect of the patch looked rather odd -- sampling a wb_err_t at
> some arbitrary time, and then comparing it later.  So that for going to
> the trouble of explaining it.
> 
> I suspect that the filemap_check_wb_error() will need to be moved
> into some parent of the current call site, which is essentially what you
> suggest below.  It would be nice if we could do that first, rather than
> having the current rather odd code.  But maybe this way is an easier
> transition.  It isn't obviously wrong, it just isn't obviously right
> either.
> 

I thought about this a bit over the weekend, and I think we have a
couple of other options here when we don't have a struct file to work
with.

1) we could pass in a zeroed out value for "since" when we don't have
struct file. The way the implementation works, that tells you whether
there has ever been a writeback error since the inode was first
instantiated. The downside there is that we don't have a way to clear
this out until the inode is evicted, so that's a clear behavior change
from how AS_* flags work today.

2) we could store a second wb_err_t value in the mapping. That one
would basically act as the wb_err_t "cursor" for these cases. We have
to do something like filemap_report_wb_error but would swap the value
into the mapping's cursor instead of dealing with struct file. That's
also not perfectly like what we have with AS_EIO/AS_ENOSPC flags, but
is probably close enough not to matter. Doing that with atomics may be
tricky though.

Alternately, we could just try to plumb in reasonable "since" values
for all of the callers. Some of those won't be too hard to do (and some
don't even really need the errors at all), but some I really have no
clue on.

Option #2 above gives us a stopgap way to move those callers over
without having to have them worry about the details as much.

> >  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> >  {
> > +	int ret, ret2;
> >  	struct inode *inode = file->f_mapping->host;
> >  
> >  	if (!file->f_op->fsync)
> > @@ -192,7 +193,9 @@ int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
> >  		spin_unlock(&inode->i_lock);
> >  		mark_inode_dirty_sync(inode);
> >  	}
> > -	return call_fsync(file, start, end, datasync);
> > +	ret = call_fsync(file, start, end, datasync);
> > +	ret2 = filemap_report_wb_error(file);
> > +	return ret ? : ret2;
> >  }
> >  EXPORT_SYMBOL(vfs_fsync_range);
> >  
> 
> ....
> 
> >  static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >  {
> > +	int ret, ret2;
> >  	struct shm_file_data *sfd = shm_file_data(file);
> >  
> >  	if (!sfd->file->f_op->fsync)
> >  		return -EINVAL;
> > -	return call_fsync(sfd->file, start, end, datasync);
> > +	ret = call_fsync(sfd->file, start, end, datasync);
> > +	ret2 = filemap_report_wb_error(file);
> > +	return ret ? : ret2;
> >  }
> 
> These are the only two places that call_fsync() is called.
> Did you consider moving the filemap_report_wb_error() call into
> call_fsync() ??
> 
> Thanks,
> NeilBrown
NeilBrown April 17, 2017, 10:56 p.m. UTC | #4
On Wed, Apr 12 2017, Jeff Layton wrote:

> On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
>> 
>> I suspect that the filemap_check_wb_error() will need to be moved
>> into some parent of the current call site, which is essentially what you
>> suggest below.  It would be nice if we could do that first, rather than
>> having the current rather odd code.  But maybe this way is an easier
>> transition.  It isn't obviously wrong, it just isn't obviously right
>> either.
>> 
>
> Yeah. It's just such a daunting task to have to change so much of the
> existing code. I'm looking for ways to make this simpler.
>
> I think it probably is reasonable for filemap_write_and_wait* to just
> sample it as early as possible in those functions. filemap_fdatawait is
> the real questionable one, as you may have already had some writebacks
> complete with errors.
>
> In any case, my thinking was that the old code is not obviously correct
> either, so while this shortens the "error capture window" on these
> calls, it seems like a reasonable place to start improving things.

I agree.  It wouldn't hurt to add a note to this effect in the patch
comment so that people understand that the code isn't seen to be
"correct" but only "no worse" with clear direction on what sort of
improvement might be appropriate.

Thanks,
NeilBrown
Jeff Layton April 21, 2017, 12:46 p.m. UTC | #5
On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
> On Wed, Apr 12 2017, Jeff Layton wrote:
> 
> > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> > > 
> > > I suspect that the filemap_check_wb_error() will need to be moved
> > > into some parent of the current call site, which is essentially what you
> > > suggest below.  It would be nice if we could do that first, rather than
> > > having the current rather odd code.  But maybe this way is an easier
> > > transition.  It isn't obviously wrong, it just isn't obviously right
> > > either.
> > > 
> > 
> > Yeah. It's just such a daunting task to have to change so much of the
> > existing code. I'm looking for ways to make this simpler.
> > 
> > I think it probably is reasonable for filemap_write_and_wait* to just
> > sample it as early as possible in those functions. filemap_fdatawait is
> > the real questionable one, as you may have already had some writebacks
> > complete with errors.
> > 
> > In any case, my thinking was that the old code is not obviously correct
> > either, so while this shortens the "error capture window" on these
> > calls, it seems like a reasonable place to start improving things.
> 
> I agree.  It wouldn't hurt to add a note to this effect in the patch
> comment so that people understand that the code isn't seen to be
> "correct" but only "no worse" with clear direction on what sort of
> improvement might be appropriate.
> 

I've got a cleaned-up set that is getting close to ready for
reposting. Before I do though, I think there is another option here
that's worth discussing.

We could store a second wb_err_t (aka errseq_t in the new set) in the
mapping that would would basically act as a "cursor" for these cases.
filemap_check_errors would need to do something like 
filemap_report_wb_error, but it would swap the value into the mapping's
cursor instead of dealing with the one in struct file.

I don't really like adding yet another field here, but the struct
address_space definition has this:

    __attribute__((aligned(sizeof(long))));

Adding the wb_err field means that we end up growing the struct by 8
bytes on x86_64 anyway. Adding another 4 bytes would just consume the
pad, so it wouldn't cost anything there. YMMV on other arches of
course.

That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
flags, but is probably close enough not to matter.

So...this would let us limp along for even longer with the model of
reporting since last check. I'm not sure that's a good thing though. A
long term goal here is to have kernel code that's dealing with
writeback be more deliberate about the point from which it's checking
errors, and this doesn't help promote that.
NeilBrown April 23, 2017, 10:38 p.m. UTC | #6
On Fri, Apr 21 2017, Jeff Layton wrote:

> On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
>> On Wed, Apr 12 2017, Jeff Layton wrote:
>> 
>> > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
>> > > 
>> > > I suspect that the filemap_check_wb_error() will need to be moved
>> > > into some parent of the current call site, which is essentially what you
>> > > suggest below.  It would be nice if we could do that first, rather than
>> > > having the current rather odd code.  But maybe this way is an easier
>> > > transition.  It isn't obviously wrong, it just isn't obviously right
>> > > either.
>> > > 
>> > 
>> > Yeah. It's just such a daunting task to have to change so much of the
>> > existing code. I'm looking for ways to make this simpler.
>> > 
>> > I think it probably is reasonable for filemap_write_and_wait* to just
>> > sample it as early as possible in those functions. filemap_fdatawait is
>> > the real questionable one, as you may have already had some writebacks
>> > complete with errors.
>> > 
>> > In any case, my thinking was that the old code is not obviously correct
>> > either, so while this shortens the "error capture window" on these
>> > calls, it seems like a reasonable place to start improving things.
>> 
>> I agree.  It wouldn't hurt to add a note to this effect in the patch
>> comment so that people understand that the code isn't seen to be
>> "correct" but only "no worse" with clear direction on what sort of
>> improvement might be appropriate.
>> 
>
> I've got a cleaned-up set that is getting close to ready for
> reposting. Before I do though, I think there is another option here
> that's worth discussing.
>
> We could store a second wb_err_t (aka errseq_t in the new set) in the
> mapping that would would basically act as a "cursor" for these cases.
> filemap_check_errors would need to do something like 
> filemap_report_wb_error, but it would swap the value into the mapping's
> cursor instead of dealing with the one in struct file.
>
> I don't really like adding yet another field here, but the struct
> address_space definition has this:
>
>     __attribute__((aligned(sizeof(long))));
>
> Adding the wb_err field means that we end up growing the struct by 8
> bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> pad, so it wouldn't cost anything there. YMMV on other arches of
> course.
>
> That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> flags, but is probably close enough not to matter.
>
> So...this would let us limp along for even longer with the model of
> reporting since last check. I'm not sure that's a good thing though. A
> long term goal here is to have kernel code that's dealing with
> writeback be more deliberate about the point from which it's checking
> errors, and this doesn't help promote that.

I think this question needs some input from filesystem developers who
might be affected by the answer.

My preference is to not add this field.  I think we would eventually
want to remove it again, and it is easier to ensure it doesn't stay
forever if it is never added.
The version without this field isn't (I think) too bad, but maybe it is
bad enough to motivate fs developers to create a better solution in each
individual case.

If some filesystem developer says they don't like that sort of social
engineering, or objects for any other reason, I will bow to the superior
stake they hold.

NeilBrown
Jeff Layton April 24, 2017, 11:50 a.m. UTC | #7
On Mon, 2017-04-24 at 08:38 +1000, NeilBrown wrote:
> On Fri, Apr 21 2017, Jeff Layton wrote:
> 
> > On Tue, 2017-04-18 at 08:56 +1000, NeilBrown wrote:
> > > On Wed, Apr 12 2017, Jeff Layton wrote:
> > > 
> > > > On Thu, 2017-04-13 at 08:14 +1000, NeilBrown wrote:
> > > > > 
> > > > > I suspect that the filemap_check_wb_error() will need to be moved
> > > > > into some parent of the current call site, which is essentially what you
> > > > > suggest below.  It would be nice if we could do that first, rather than
> > > > > having the current rather odd code.  But maybe this way is an easier
> > > > > transition.  It isn't obviously wrong, it just isn't obviously right
> > > > > either.
> > > > > 
> > > > 
> > > > Yeah. It's just such a daunting task to have to change so much of the
> > > > existing code. I'm looking for ways to make this simpler.
> > > > 
> > > > I think it probably is reasonable for filemap_write_and_wait* to just
> > > > sample it as early as possible in those functions. filemap_fdatawait is
> > > > the real questionable one, as you may have already had some writebacks
> > > > complete with errors.
> > > > 
> > > > In any case, my thinking was that the old code is not obviously correct
> > > > either, so while this shortens the "error capture window" on these
> > > > calls, it seems like a reasonable place to start improving things.
> > > 
> > > I agree.  It wouldn't hurt to add a note to this effect in the patch
> > > comment so that people understand that the code isn't seen to be
> > > "correct" but only "no worse" with clear direction on what sort of
> > > improvement might be appropriate.
> > > 
> > 
> > I've got a cleaned-up set that is getting close to ready for
> > reposting. Before I do though, I think there is another option here
> > that's worth discussing.
> > 
> > We could store a second wb_err_t (aka errseq_t in the new set) in the
> > mapping that would would basically act as a "cursor" for these cases.
> > filemap_check_errors would need to do something like 
> > filemap_report_wb_error, but it would swap the value into the mapping's
> > cursor instead of dealing with the one in struct file.
> > 
> > I don't really like adding yet another field here, but the struct
> > address_space definition has this:
> > 
> >     __attribute__((aligned(sizeof(long))));
> > 
> > Adding the wb_err field means that we end up growing the struct by 8
> > bytes on x86_64 anyway. Adding another 4 bytes would just consume the
> > pad, so it wouldn't cost anything there. YMMV on other arches of
> > course.
> > 
> > That's also not perfectly like what we have with AS_EIO/AS_ENOSPC
> > flags, but is probably close enough not to matter.
> > 
> > So...this would let us limp along for even longer with the model of
> > reporting since last check. I'm not sure that's a good thing though. A
> > long term goal here is to have kernel code that's dealing with
> > writeback be more deliberate about the point from which it's checking
> > errors, and this doesn't help promote that.
> 
> I think this question needs some input from filesystem developers who
> might be affected by the answer.
> 
> My preference is to not add this field.  I think we would eventually
> want to remove it again, and it is easier to ensure it doesn't stay
> forever if it is never added.
> The version without this field isn't (I think) too bad, but maybe it is
> bad enough to motivate fs developers to create a better solution in each
> individual case.
> 
> If some filesystem developer says they don't like that sort of social
> engineering, or objects for any other reason, I will bow to the superior
> stake they hold.
> 
> 

That's pretty much my view too. I just figured I needed to throw the
option out there in the interest of full disclosure.

I think keeping a per-mapping cursor like this does make sense in some
situations though. For instance, there does seem to be quite a bit of
local fs journaling code that goes through the pagecache. For those, I
could see keeping the cursor in some sort of per-journal structure, and
doing a check-and-advance against that in appropriate places.

This is an option we can bring up for folks who do want to continue to
use a similar error tracking model in these situations though.
diff mbox

Patch

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 1e0119f1c46a..493bf052fa33 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -577,7 +577,7 @@  written at any point after PG_Dirty is clear.  Once it is known to be
 safe, PG_Writeback is cleared.
 
 If there is an error during writeback, then the address_space should be
-marked with an error (typically using filemap_set_wb_error), in order to
+marked with an error (typically using mapping_set_error), in order to
 ensure that the error can later be reported to the application when an
 fsync is issued.
 
@@ -895,10 +895,9 @@  otherwise noted.
 
   release: called when the last reference to an open file is closed
 
-  fsync: called by the fsync(2) system call. Filesystems that use the
-	pagecache should call filemap_report_wb_error before returning
-	to ensure that any errors that occurred during writeback are
-	reported and the file's error sequence advanced.
+  fsync: called by the fsync(2) system call. Errors that were previously
+	 recorded using mapping_set_error will automatically be returned to
+	 the application and the file's error sequence advanced.
 
   fasync: called by the fcntl(2) system call when asynchronous
 	(non-blocking) mode is enabled for a file
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 520cb7230b2d..8ae0884350dd 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1962,6 +1962,7 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	int ret = 0;
 	bool full_sync = 0;
 	u64 len;
+	wb_err_t wb_since = READ_ONCE(file->f_wb_err);
 
 	/*
 	 * The range length can be represented by u64, we have to do the typecasts
@@ -2079,14 +2080,7 @@  int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 		 */
 		clear_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 			  &BTRFS_I(inode)->runtime_flags);
-		/*
-		 * An ordered extent might have started before and completed
-		 * already with io errors, in which case the inode was not
-		 * updated and we end up here. So check the inode's mapping
-		 * flags for any errors that might have happened while doing
-		 * writeback of file data.
-		 */
-		ret = filemap_check_errors(inode->i_mapping);
+		ret = filemap_check_wb_error(inode->i_mapping, wb_since);
 		inode_unlock(inode);
 		goto out;
 	}
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index a59674c3e69e..b7f5dabae999 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3972,12 +3972,6 @@  static int wait_ordered_extents(struct btrfs_trans_handle *trans,
 			    test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)));
 
 		if (test_bit(BTRFS_ORDERED_IOERR, &ordered->flags)) {
-			/*
-			 * Clear the AS_EIO/AS_ENOSPC flags from the inode's
-			 * i_mapping flags, so that the next fsync won't get
-			 * an outdated io error too.
-			 */
-			filemap_check_errors(inode->i_mapping);
 			*ordered_io_error = true;
 			break;
 		}
@@ -4171,6 +4165,7 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	u64 test_gen;
 	int ret = 0;
 	int num = 0;
+	wb_err_t since = filemap_sample_wb_error(inode->vfs_inode.i_mapping);
 
 	INIT_LIST_HEAD(&extents);
 
@@ -4214,7 +4209,7 @@  static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans,
 	 * without writing to the log tree and the fsync must report the
 	 * file data write error and not commit the current transaction.
 	 */
-	ret = filemap_check_errors(inode->vfs_inode.i_mapping);
+	ret = filemap_check_wb_error(inode->vfs_inode.i_mapping, since);
 	if (ret)
 		ctx->io_err = ret;
 process:
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca77d337c94..bfb87969ad9b 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -175,12 +175,13 @@  int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 	int ret;
 	struct super_block *sb = file->f_mapping->host->i_sb;
 	struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+	wb_err_t since = READ_ONCE(file->f_wb_err);
 
 	ret = generic_file_fsync(file, start, end, datasync);
 	if (ret == -EIO)
 		ret = -EIO;
 	else
-		ret = filemap_check_errors(mapping);
+		ret = filemap_check_wb_error(mapping, since);
 
 	if (ret) {
 		/* We don't really know where the IO error happened... */
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 5f7317875a67..2ab4a28aa933 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -187,6 +187,7 @@  static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
 		.nr_to_write = LONG_MAX,
 		.for_reclaim = 0,
 	};
+	wb_err_t since = READ_ONCE(file->f_wb_err);
 
 	if (unlikely(f2fs_readonly(inode->i_sb)))
 		return 0;
@@ -265,6 +266,8 @@  static int f2fs_do_sync_file(struct file *file, loff_t start, loff_t end,
 	}
 
 	ret = wait_on_node_pages_writeback(sbi, ino);
+	if (ret == 0)
+		ret = filemap_check_wb_error(NODE_MAPPING(sbi), since);
 	if (ret)
 		goto out;
 
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 481aa8dc79f4..b3ef9504fd8b 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1630,7 +1630,7 @@  int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
 {
 	pgoff_t index = 0, end = ULONG_MAX;
 	struct pagevec pvec;
-	int ret2, ret = 0;
+	int ret = 0;
 
 	pagevec_init(&pvec, 0);
 
@@ -1658,10 +1658,6 @@  int wait_on_node_pages_writeback(struct f2fs_sb_info *sbi, nid_t ino)
 		pagevec_release(&pvec);
 		cond_resched();
 	}
-
-	ret2 = filemap_check_errors(NODE_MAPPING(sbi));
-	if (!ret)
-		ret = ret2;
 	return ret;
 }
 
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ec238fb5a584..7972fdb189bf 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -398,6 +398,7 @@  static int fuse_flush(struct file *file, fl_owner_t id)
 	struct fuse_req *req;
 	struct fuse_flush_in inarg;
 	int err;
+	wb_err_t since = READ_ONCE(file->f_wb_err);
 
 	if (is_bad_inode(inode))
 		return -EIO;
@@ -413,7 +414,7 @@  static int fuse_flush(struct file *file, fl_owner_t id)
 	fuse_sync_writes(inode);
 	inode_unlock(inode);
 
-	err = filemap_check_errors(file->f_mapping);
+	err = filemap_check_wb_error(file->f_mapping, since);
 	if (err)
 		return err;
 
@@ -446,6 +447,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	FUSE_ARGS(args);
 	struct fuse_fsync_in inarg;
 	int err;
+	wb_err_t since;
 
 	if (is_bad_inode(inode))
 		return -EIO;
@@ -461,6 +463,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	if (err)
 		goto out;
 
+	since = READ_ONCE(file->f_wb_err);
 	fuse_sync_writes(inode);
 
 	/*
@@ -468,7 +471,7 @@  int fuse_fsync_common(struct file *file, loff_t start, loff_t end,
 	 * filemap_write_and_wait_range() does not catch errors.
 	 * We have to do this directly after fuse_sync_writes()
 	 */
-	err = filemap_check_errors(file->f_mapping);
+	err = filemap_check_wb_error(file->f_mapping, since);
 	if (err)
 		goto out;
 
diff --git a/fs/sync.c b/fs/sync.c
index 11ba023434b1..cfde7a2b88ec 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -182,6 +182,7 @@  SYSCALL_DEFINE1(syncfs, int, fd)
  */
 int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
 {
+	int ret, ret2;
 	struct inode *inode = file->f_mapping->host;
 
 	if (!file->f_op->fsync)
@@ -192,7 +193,9 @@  int vfs_fsync_range(struct file *file, loff_t start, loff_t end, int datasync)
 		spin_unlock(&inode->i_lock);
 		mark_inode_dirty_sync(inode);
 	}
-	return call_fsync(file, start, end, datasync);
+	ret = call_fsync(file, start, end, datasync);
+	ret2 = filemap_report_wb_error(file);
+	return ret ? : ret2;
 }
 EXPORT_SYMBOL(vfs_fsync_range);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bcc791d43c6e..bd2cfee92b4f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2532,7 +2532,6 @@  extern int __filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end, int sync_mode);
 extern int filemap_fdatawrite_range(struct address_space *mapping,
 				loff_t start, loff_t end);
-extern int filemap_check_errors(struct address_space *mapping);
 extern void __filemap_set_wb_error(struct address_space *mapping, int err);
 extern wb_err_t filemap_sample_wb_error(struct address_space *mapping);
 extern int filemap_report_wb_error(struct file *file);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 84943e8057ef..458caa916244 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -30,12 +30,7 @@  enum mapping_flags {
 
 static inline void mapping_set_error(struct address_space *mapping, int error)
 {
-	if (unlikely(error)) {
-		if (error == -ENOSPC)
-			set_bit(AS_ENOSPC, &mapping->flags);
-		else
-			set_bit(AS_EIO, &mapping->flags);
-	}
+	return filemap_set_wb_error(mapping, error);
 }
 
 static inline void mapping_set_unevictable(struct address_space *mapping)
diff --git a/ipc/shm.c b/ipc/shm.c
index 481d2a9c298a..5d8df57cfc76 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -448,11 +448,14 @@  static int shm_release(struct inode *ino, struct file *file)
 
 static int shm_fsync(struct file *file, loff_t start, loff_t end, int datasync)
 {
+	int ret, ret2;
 	struct shm_file_data *sfd = shm_file_data(file);
 
 	if (!sfd->file->f_op->fsync)
 		return -EINVAL;
-	return call_fsync(sfd->file, start, end, datasync);
+	ret = call_fsync(sfd->file, start, end, datasync);
+	ret2 = filemap_report_wb_error(file);
+	return ret ? : ret2;
 }
 
 static long shm_fallocate(struct file *file, int mode, loff_t offset,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4966e9dea945..525dddc15abb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -295,20 +295,6 @@  void delete_from_page_cache(struct page *page)
 }
 EXPORT_SYMBOL(delete_from_page_cache);
 
-int filemap_check_errors(struct address_space *mapping)
-{
-	int ret = 0;
-	/* Check for outstanding write errors */
-	if (test_bit(AS_ENOSPC, &mapping->flags) &&
-	    test_and_clear_bit(AS_ENOSPC, &mapping->flags))
-		ret = -ENOSPC;
-	if (test_bit(AS_EIO, &mapping->flags) &&
-	    test_and_clear_bit(AS_EIO, &mapping->flags))
-		ret = -EIO;
-	return ret;
-}
-EXPORT_SYMBOL(filemap_check_errors);
-
 /**
  * __filemap_fdatawrite_range - start writeback on mapping dirty pages in range
  * @mapping:	address space structure to write
@@ -431,9 +417,10 @@  int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 			    loff_t end_byte)
 {
 	int ret, ret2;
+	wb_err_t since = READ_ONCE(mapping->wb_err);
 
 	ret = __filemap_fdatawait_range(mapping, start_byte, end_byte);
-	ret2 = filemap_check_errors(mapping);
+	ret2 = filemap_check_wb_error(mapping, since);
 	if (!ret)
 		ret = ret2;
 
@@ -489,6 +476,7 @@  EXPORT_SYMBOL(filemap_fdatawait);
 int filemap_write_and_wait(struct address_space *mapping)
 {
 	int err = 0;
+	wb_err_t since = READ_ONCE(mapping->wb_err);
 
 	if ((!dax_mapping(mapping) && mapping->nrpages) ||
 	    (dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -500,12 +488,12 @@  int filemap_write_and_wait(struct address_space *mapping)
 		 * thing (e.g. bug) happened, so we avoid waiting for it.
 		 */
 		if (err != -EIO) {
-			int err2 = filemap_fdatawait(mapping);
+			filemap_fdatawait_keep_errors(mapping);
 			if (!err)
-				err = err2;
+				err = filemap_check_wb_error(mapping, since);
 		}
 	} else {
-		err = filemap_check_errors(mapping);
+		err = filemap_check_wb_error(mapping, since);
 	}
 	return err;
 }
@@ -526,6 +514,7 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 				 loff_t lstart, loff_t lend)
 {
 	int err = 0;
+	wb_err_t since = READ_ONCE(mapping->wb_err);
 
 	if ((!dax_mapping(mapping) && mapping->nrpages) ||
 	    (dax_mapping(mapping) && mapping->nrexceptional)) {
@@ -533,13 +522,12 @@  int filemap_write_and_wait_range(struct address_space *mapping,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
 		if (err != -EIO) {
-			int err2 = filemap_fdatawait_range(mapping,
-						lstart, lend);
+			__filemap_fdatawait_range(mapping, lstart, lend);
 			if (!err)
-				err = err2;
+				err = filemap_check_wb_error(mapping, since);
 		}
 	} else {
-		err = filemap_check_errors(mapping);
+		err = filemap_check_wb_error(mapping, since);
 	}
 	return err;
 }