[0/3] mm/fs: get PG_error out of the writeback reporting business
diff mbox

Message ID 1488724854.2925.6.camel@redhat.com
State New
Headers show

Commit Message

Jeff Layton March 5, 2017, 2:40 p.m. UTC
On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> I recently did some work to wire up -ENOSPC handling in ceph, and found
> I could get back -EIO errors in some cases when I should have instead
> gotten -ENOSPC. The problem was that the ceph writeback code would set
> PG_error on a writeback error, and that error would clobber the mapping
> error.
> 

I should also note that relying on PG_error to report writeback errors
is inherently unreliable as well. If someone calls sync() before your
fsync gets in there, then you'll likely lose it anyway.

filemap_fdatawait_keep_errors will preserve the error in the mapping,
but not the individual PG_error flags, so I think we do want to ensure
that the mapping error is set when there is a writeback error and not
rely on PG_error bit for that.

> While I fixed that problem by simply not setting that bit on errors,
> that led me down a rabbit hole of looking at how PG_error is being
> handled in the kernel.
> 
> This patch series is a few fixes for things that I 100% noticed by
> inspection. I don't have a great way to test these since they involve
> error handling. I can certainly doctor up a kernel to inject errors
> in this code and test by hand however if these look plausible up front.
> 
> Jeff Layton (3):
>   nilfs2: set the mapping error when calling SetPageError on writeback
>   mm: don't TestClearPageError in __filemap_fdatawait_range
>   mm: set mapping error when launder_pages fails
> 
>  fs/nilfs2/segment.c |  1 +
>  mm/filemap.c        | 19 ++++---------------
>  mm/truncate.c       |  6 +++++-
>  3 files changed, 10 insertions(+), 16 deletions(-)
> 

(cc'ing Ross...)

Just when I thought that only NILFS2 needed a little work here, I see
another spot...

I think that we should also need to fix dax_writeback_mapping_range to
set a mapping error on writeback as well. It looks like that's not
happening today. Something like the patch below (obviously untested).

I'll also plan to follow up with a patch to vfs.txt to outline how
writeback errors should be handled by filesystems, assuming that this
patchset isn't completely off base.

-------------------8<-----------------------

[PATCH] dax: set error in mapping when writeback fails

In order to get proper error codes from fsync, we must set an error in
the mapping range when writeback fails.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/dax.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Ross Zwisler March 6, 2017, 11:08 p.m. UTC | #1
On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > I could get back -EIO errors in some cases when I should have instead
> > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > PG_error on a writeback error, and that error would clobber the mapping
> > error.
> > 
> 
> I should also note that relying on PG_error to report writeback errors
> is inherently unreliable as well. If someone calls sync() before your
> fsync gets in there, then you'll likely lose it anyway.
> 
> filemap_fdatawait_keep_errors will preserve the error in the mapping,
> but not the individual PG_error flags, so I think we do want to ensure
> that the mapping error is set when there is a writeback error and not
> rely on PG_error bit for that.
> 
> > While I fixed that problem by simply not setting that bit on errors,
> > that led me down a rabbit hole of looking at how PG_error is being
> > handled in the kernel.
> > 
> > This patch series is a few fixes for things that I 100% noticed by
> > inspection. I don't have a great way to test these since they involve
> > error handling. I can certainly doctor up a kernel to inject errors
> > in this code and test by hand however if these look plausible up front.
> > 
> > Jeff Layton (3):
> >   nilfs2: set the mapping error when calling SetPageError on writeback
> >   mm: don't TestClearPageError in __filemap_fdatawait_range
> >   mm: set mapping error when launder_pages fails
> > 
> >  fs/nilfs2/segment.c |  1 +
> >  mm/filemap.c        | 19 ++++---------------
> >  mm/truncate.c       |  6 +++++-
> >  3 files changed, 10 insertions(+), 16 deletions(-)
> > 
> 
> (cc'ing Ross...)
> 
> Just when I thought that only NILFS2 needed a little work here, I see
> another spot...
> 
> I think that we should also need to fix dax_writeback_mapping_range to
> set a mapping error on writeback as well. It looks like that's not
> happening today. Something like the patch below (obviously untested).
> 
> I'll also plan to follow up with a patch to vfs.txt to outline how
> writeback errors should be handled by filesystems, assuming that this
> patchset isn't completely off base.
> 
> -------------------8<-----------------------
> 
> [PATCH] dax: set error in mapping when writeback fails
> 
> In order to get proper error codes from fsync, we must set an error in
> the mapping range when writeback fails.
> 
> Signed-off-by: Jeff Layton <jlayton@redhat.com>
> ---
>  fs/dax.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index c45598b912e1..9005d90deeda 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  
>  			ret = dax_writeback_one(bdev, mapping, indices[i],
>  					pvec.pages[i]);
> -			if (ret < 0)
> +			if (ret < 0) {
> +				mapping_set_error(mapping, ret);
>  				return ret;
> +			}

(Adding Jan)

I tested this a bit, and for the DAX case at least I don't think this does
what you want.  The current code already returns -EIO if dax_writeback_one()
hits an error, which bubbles up through the call stack and makes the fsync()
call in userspace fail with EIO, as we want.  With both ext4 and xfs this
patch (applied to v4.10) makes it so that we fail the current fsync() due to
the return value of -EIO, then we fail the next fsync() as well because only
then do we actually process the AS_EIO flag inside of filemap_check_errors().

I think maybe the missing piece is that our normal DAX fsync call stack
doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
our stack in xfs:

    dax_writeback_mapping_range+0x32/0x70
    xfs_vm_writepages+0x8c/0xf0
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait_range+0x44/0x90
    xfs_file_fsync+0x7a/0x2c0
    vfs_fsync_range+0x4b/0xb0
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    entry_SYSCALL_64_fastpath+0x1f/0xc2

On the subsequent fsync() call we *do* end up calling filemap_check_errors()
via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
mapping:

    filemap_fdatawait_range+0x3b/0x80
    filemap_write_and_wait_range+0x5a/0x90
    xfs_file_fsync+0x7a/0x2c0
    vfs_fsync_range+0x4b/0xb0
    ? trace_hardirqs_on_caller+0xf5/0x1b0
    do_fsync+0x3d/0x70
    SyS_fsync+0x10/0x20
    entry_SYSCALL_64_fastpath+0x1f/0xc2

Was your concern just that you didn't think that fsync() was properly
returning an error when dax_writeback_one() hit an error?  Or is there another
path by which we need to report the error, where it is actually important that
we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
call path so that we both generate and consume AS_EIO on the same call,
probably in filemap_write_and_wait_range().
Jan Kara March 7, 2017, 10:26 a.m. UTC | #2
On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > I could get back -EIO errors in some cases when I should have instead
> > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > PG_error on a writeback error, and that error would clobber the mapping
> > > error.
> > > 
> > 
> > I should also note that relying on PG_error to report writeback errors
> > is inherently unreliable as well. If someone calls sync() before your
> > fsync gets in there, then you'll likely lose it anyway.
> > 
> > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > but not the individual PG_error flags, so I think we do want to ensure
> > that the mapping error is set when there is a writeback error and not
> > rely on PG_error bit for that.
> > 
> > > While I fixed that problem by simply not setting that bit on errors,
> > > that led me down a rabbit hole of looking at how PG_error is being
> > > handled in the kernel.
> > > 
> > > This patch series is a few fixes for things that I 100% noticed by
> > > inspection. I don't have a great way to test these since they involve
> > > error handling. I can certainly doctor up a kernel to inject errors
> > > in this code and test by hand however if these look plausible up front.
> > > 
> > > Jeff Layton (3):
> > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > >   mm: set mapping error when launder_pages fails
> > > 
> > >  fs/nilfs2/segment.c |  1 +
> > >  mm/filemap.c        | 19 ++++---------------
> > >  mm/truncate.c       |  6 +++++-
> > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > 
> > 
> > (cc'ing Ross...)
> > 
> > Just when I thought that only NILFS2 needed a little work here, I see
> > another spot...
> > 
> > I think that we should also need to fix dax_writeback_mapping_range to
> > set a mapping error on writeback as well. It looks like that's not
> > happening today. Something like the patch below (obviously untested).
> > 
> > I'll also plan to follow up with a patch to vfs.txt to outline how
> > writeback errors should be handled by filesystems, assuming that this
> > patchset isn't completely off base.
> > 
> > -------------------8<-----------------------
> > 
> > [PATCH] dax: set error in mapping when writeback fails
> > 
> > In order to get proper error codes from fsync, we must set an error in
> > the mapping range when writeback fails.
> > 
> > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > ---
> >  fs/dax.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index c45598b912e1..9005d90deeda 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> >  
> >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> >  					pvec.pages[i]);
> > -			if (ret < 0)
> > +			if (ret < 0) {
> > +				mapping_set_error(mapping, ret);
> >  				return ret;
> > +			}
> 
> (Adding Jan)
> 
> I tested this a bit, and for the DAX case at least I don't think this does
> what you want.  The current code already returns -EIO if dax_writeback_one()
> hits an error, which bubbles up through the call stack and makes the fsync()
> call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> patch (applied to v4.10) makes it so that we fail the current fsync() due to
> the return value of -EIO, then we fail the next fsync() as well because only
> then do we actually process the AS_EIO flag inside of filemap_check_errors().
> 
> I think maybe the missing piece is that our normal DAX fsync call stack
> doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> our stack in xfs:
> 
>     dax_writeback_mapping_range+0x32/0x70
>     xfs_vm_writepages+0x8c/0xf0
>     do_writepages+0x21/0x30
>     __filemap_fdatawrite_range+0xc6/0x100
>     filemap_write_and_wait_range+0x44/0x90
>     xfs_file_fsync+0x7a/0x2c0
>     vfs_fsync_range+0x4b/0xb0
>     ? trace_hardirqs_on_caller+0xf5/0x1b0
>     do_fsync+0x3d/0x70
>     SyS_fsync+0x10/0x20
>     entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> mapping:
> 
>     filemap_fdatawait_range+0x3b/0x80
>     filemap_write_and_wait_range+0x5a/0x90
>     xfs_file_fsync+0x7a/0x2c0
>     vfs_fsync_range+0x4b/0xb0
>     ? trace_hardirqs_on_caller+0xf5/0x1b0
>     do_fsync+0x3d/0x70
>     SyS_fsync+0x10/0x20
>     entry_SYSCALL_64_fastpath+0x1f/0xc2
> 
> Was your concern just that you didn't think that fsync() was properly
> returning an error when dax_writeback_one() hit an error?  Or is there another
> path by which we need to report the error, where it is actually important that
> we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> call path so that we both generate and consume AS_EIO on the same call,
> probably in filemap_write_and_wait_range().

So I believe this is due to the special handling of EIO inside
filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
filemap_fdatawait() there however not for EIO returned from
filemap_fdatawrite(). In that case we bail out immediately. So I think
Jeff's patch is correct but we need to change filemap_write_and_wait() to
call also filemap_check_errors() directly on EIO from filemap_fdatawrite().

On a more general note (DAX is actually fine here), I find the current
practice of clearing page dirty bits on error and reporting it just once
problematic. It keeps the system running but data is lost and possibly
without getting the error anywhere where it is useful. We get away with
this because it is a rare event but it seems like a problematic behavior.
But this is more for the discussion at LSF.

								Honza
Jeff Layton March 7, 2017, 2:03 p.m. UTC | #3
On Tue, 2017-03-07 at 11:26 +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > > 
> > > 
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > > 
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > > 
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > > 
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > > 
> > > > Jeff Layton (3):
> > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > >   mm: set mapping error when launder_pages fails
> > > > 
> > > >  fs/nilfs2/segment.c |  1 +
> > > >  mm/filemap.c        | 19 ++++---------------
> > > >  mm/truncate.c       |  6 +++++-
> > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > 
> > > 
> > > (cc'ing Ross...)
> > > 
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > > 
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > > 
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > > 
> > > -------------------8<-----------------------
> > > 
> > > [PATCH] dax: set error in mapping when writeback fails
> > > 
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/dax.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >  
> > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > >  					pvec.pages[i]);
> > > -			if (ret < 0)
> > > +			if (ret < 0) {
> > > +				mapping_set_error(mapping, ret);
> > >  				return ret;
> > > +			}
> > 
> > (Adding Jan)
> > 
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want.  The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > 
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > our stack in xfs:
> > 
> >     dax_writeback_mapping_range+0x32/0x70
> >     xfs_vm_writepages+0x8c/0xf0
> >     do_writepages+0x21/0x30
> >     __filemap_fdatawrite_range+0xc6/0x100
> >     filemap_write_and_wait_range+0x44/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> > 
> >     filemap_fdatawait_range+0x3b/0x80
> >     filemap_write_and_wait_range+0x5a/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error?  Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
> 
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
> 

Yes that makes total sense. I've got a filemap_write_and_wait patch in
my pile now that does this. I'll run what I have through an xfstests
run, and see how it does, and will plan to post a v2 set once I do.

> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
> 

That really is the crux of the matter. Unfortunately, that's sort of how
the POSIX write/fsync model is designed. If we want to change that, then
I think that we have to consider what a new interface for this would
look like. Maybe we can do something there with new sync_file_range
flags?

I think this probably also dovetails with Kevin Wolf's proposed LSF
topic too, so maybe we can discuss all of this together there.
Ross Zwisler March 7, 2017, 3:59 p.m. UTC | #4
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > I could get back -EIO errors in some cases when I should have instead
> > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > error.
> > > > 
> > > 
> > > I should also note that relying on PG_error to report writeback errors
> > > is inherently unreliable as well. If someone calls sync() before your
> > > fsync gets in there, then you'll likely lose it anyway.
> > > 
> > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > but not the individual PG_error flags, so I think we do want to ensure
> > > that the mapping error is set when there is a writeback error and not
> > > rely on PG_error bit for that.
> > > 
> > > > While I fixed that problem by simply not setting that bit on errors,
> > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > handled in the kernel.
> > > > 
> > > > This patch series is a few fixes for things that I 100% noticed by
> > > > inspection. I don't have a great way to test these since they involve
> > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > in this code and test by hand however if these look plausible up front.
> > > > 
> > > > Jeff Layton (3):
> > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > >   mm: set mapping error when launder_pages fails
> > > > 
> > > >  fs/nilfs2/segment.c |  1 +
> > > >  mm/filemap.c        | 19 ++++---------------
> > > >  mm/truncate.c       |  6 +++++-
> > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > 
> > > 
> > > (cc'ing Ross...)
> > > 
> > > Just when I thought that only NILFS2 needed a little work here, I see
> > > another spot...
> > > 
> > > I think that we should also need to fix dax_writeback_mapping_range to
> > > set a mapping error on writeback as well. It looks like that's not
> > > happening today. Something like the patch below (obviously untested).
> > > 
> > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > writeback errors should be handled by filesystems, assuming that this
> > > patchset isn't completely off base.
> > > 
> > > -------------------8<-----------------------
> > > 
> > > [PATCH] dax: set error in mapping when writeback fails
> > > 
> > > In order to get proper error codes from fsync, we must set an error in
> > > the mapping range when writeback fails.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > ---
> > >  fs/dax.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index c45598b912e1..9005d90deeda 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > >  
> > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > >  					pvec.pages[i]);
> > > -			if (ret < 0)
> > > +			if (ret < 0) {
> > > +				mapping_set_error(mapping, ret);
> > >  				return ret;
> > > +			}
> > 
> > (Adding Jan)
> > 
> > I tested this a bit, and for the DAX case at least I don't think this does
> > what you want.  The current code already returns -EIO if dax_writeback_one()
> > hits an error, which bubbles up through the call stack and makes the fsync()
> > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > the return value of -EIO, then we fail the next fsync() as well because only
> > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > 
> > I think maybe the missing piece is that our normal DAX fsync call stack
> > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > our stack in xfs:
> > 
> >     dax_writeback_mapping_range+0x32/0x70
> >     xfs_vm_writepages+0x8c/0xf0
> >     do_writepages+0x21/0x30
> >     __filemap_fdatawrite_range+0xc6/0x100
> >     filemap_write_and_wait_range+0x44/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > mapping:
> > 
> >     filemap_fdatawait_range+0x3b/0x80
> >     filemap_write_and_wait_range+0x5a/0x90
> >     xfs_file_fsync+0x7a/0x2c0
> >     vfs_fsync_range+0x4b/0xb0
> >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> >     do_fsync+0x3d/0x70
> >     SyS_fsync+0x10/0x20
> >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > 
> > Was your concern just that you didn't think that fsync() was properly
> > returning an error when dax_writeback_one() hit an error?  Or is there another
> > path by which we need to report the error, where it is actually important that
> > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > call path so that we both generate and consume AS_EIO on the same call,
> > probably in filemap_write_and_wait_range().
> 
> So I believe this is due to the special handling of EIO inside
> filemap_write_and_wait(). Normally, filemap_check_errors() happens inside

s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
case, but we definitely want to make changes that keep them consistent.

> filemap_fdatawait() there however not for EIO returned from
> filemap_fdatawrite(). In that case we bail out immediately. So I think
> Jeff's patch is correct but we need to change filemap_write_and_wait() to
> call also filemap_check_errors() directly on EIO from filemap_fdatawrite().

So I guess my question was: why is it important that we set AS_EIO, if the EIO
is already being reported correctly?  Is it just for consistency with the
buffered fsync case, or is there currently a path where the -EIO from DAX will
be missed, and we actually need AS_EIO to be set in mapping->flags so that we
correctly report an error?

> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
Jan Kara March 7, 2017, 4:17 p.m. UTC | #5
On Tue 07-03-17 08:59:16, Ross Zwisler wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On Mon 06-03-17 16:08:01, Ross Zwisler wrote:
> > > On Sun, Mar 05, 2017 at 09:40:54AM -0500, Jeff Layton wrote:
> > > > On Sun, 2017-03-05 at 08:35 -0500, Jeff Layton wrote:
> > > > > I recently did some work to wire up -ENOSPC handling in ceph, and found
> > > > > I could get back -EIO errors in some cases when I should have instead
> > > > > gotten -ENOSPC. The problem was that the ceph writeback code would set
> > > > > PG_error on a writeback error, and that error would clobber the mapping
> > > > > error.
> > > > > 
> > > > 
> > > > I should also note that relying on PG_error to report writeback errors
> > > > is inherently unreliable as well. If someone calls sync() before your
> > > > fsync gets in there, then you'll likely lose it anyway.
> > > > 
> > > > filemap_fdatawait_keep_errors will preserve the error in the mapping,
> > > > but not the individual PG_error flags, so I think we do want to ensure
> > > > that the mapping error is set when there is a writeback error and not
> > > > rely on PG_error bit for that.
> > > > 
> > > > > While I fixed that problem by simply not setting that bit on errors,
> > > > > that led me down a rabbit hole of looking at how PG_error is being
> > > > > handled in the kernel.
> > > > > 
> > > > > This patch series is a few fixes for things that I 100% noticed by
> > > > > inspection. I don't have a great way to test these since they involve
> > > > > error handling. I can certainly doctor up a kernel to inject errors
> > > > > in this code and test by hand however if these look plausible up front.
> > > > > 
> > > > > Jeff Layton (3):
> > > > >   nilfs2: set the mapping error when calling SetPageError on writeback
> > > > >   mm: don't TestClearPageError in __filemap_fdatawait_range
> > > > >   mm: set mapping error when launder_pages fails
> > > > > 
> > > > >  fs/nilfs2/segment.c |  1 +
> > > > >  mm/filemap.c        | 19 ++++---------------
> > > > >  mm/truncate.c       |  6 +++++-
> > > > >  3 files changed, 10 insertions(+), 16 deletions(-)
> > > > > 
> > > > 
> > > > (cc'ing Ross...)
> > > > 
> > > > Just when I thought that only NILFS2 needed a little work here, I see
> > > > another spot...
> > > > 
> > > > I think that we should also need to fix dax_writeback_mapping_range to
> > > > set a mapping error on writeback as well. It looks like that's not
> > > > happening today. Something like the patch below (obviously untested).
> > > > 
> > > > I'll also plan to follow up with a patch to vfs.txt to outline how
> > > > writeback errors should be handled by filesystems, assuming that this
> > > > patchset isn't completely off base.
> > > > 
> > > > -------------------8<-----------------------
> > > > 
> > > > [PATCH] dax: set error in mapping when writeback fails
> > > > 
> > > > In order to get proper error codes from fsync, we must set an error in
> > > > the mapping range when writeback fails.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@redhat.com>
> > > > ---
> > > >  fs/dax.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index c45598b912e1..9005d90deeda 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -888,8 +888,10 @@ int dax_writeback_mapping_range(struct address_space *mapping,
> > > >  
> > > >  			ret = dax_writeback_one(bdev, mapping, indices[i],
> > > >  					pvec.pages[i]);
> > > > -			if (ret < 0)
> > > > +			if (ret < 0) {
> > > > +				mapping_set_error(mapping, ret);
> > > >  				return ret;
> > > > +			}
> > > 
> > > (Adding Jan)
> > > 
> > > I tested this a bit, and for the DAX case at least I don't think this does
> > > what you want.  The current code already returns -EIO if dax_writeback_one()
> > > hits an error, which bubbles up through the call stack and makes the fsync()
> > > call in userspace fail with EIO, as we want.  With both ext4 and xfs this
> > > patch (applied to v4.10) makes it so that we fail the current fsync() due to
> > > the return value of -EIO, then we fail the next fsync() as well because only
> > > then do we actually process the AS_EIO flag inside of filemap_check_errors().
> > > 
> > > I think maybe the missing piece is that our normal DAX fsync call stack
> > > doesn't include a call to filemap_check_errors() if we return -EIO.  Here's
> > > our stack in xfs:
> > > 
> > >     dax_writeback_mapping_range+0x32/0x70
> > >     xfs_vm_writepages+0x8c/0xf0
> > >     do_writepages+0x21/0x30
> > >     __filemap_fdatawrite_range+0xc6/0x100
> > >     filemap_write_and_wait_range+0x44/0x90
> > >     xfs_file_fsync+0x7a/0x2c0
> > >     vfs_fsync_range+0x4b/0xb0
> > >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> > >     do_fsync+0x3d/0x70
> > >     SyS_fsync+0x10/0x20
> > >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > > 
> > > On the subsequent fsync() call we *do* end up calling filemap_check_errors()
> > > via filemap_fdatawrite_range(), which tests & clears the AS_EIO flag in the
> > > mapping:
> > > 
> > >     filemap_fdatawait_range+0x3b/0x80
> > >     filemap_write_and_wait_range+0x5a/0x90
> > >     xfs_file_fsync+0x7a/0x2c0
> > >     vfs_fsync_range+0x4b/0xb0
> > >     ? trace_hardirqs_on_caller+0xf5/0x1b0
> > >     do_fsync+0x3d/0x70
> > >     SyS_fsync+0x10/0x20
> > >     entry_SYSCALL_64_fastpath+0x1f/0xc2
> > > 
> > > Was your concern just that you didn't think that fsync() was properly
> > > returning an error when dax_writeback_one() hit an error?  Or is there another
> > > path by which we need to report the error, where it is actually important that
> > > we set AS_EIO?  If it's the latter, then I think we need to rework the fsync
> > > call path so that we both generate and consume AS_EIO on the same call,
> > > probably in filemap_write_and_wait_range().
> > 
> > So I believe this is due to the special handling of EIO inside
> > filemap_write_and_wait(). Normally, filemap_check_errors() happens inside
> 
> s/filemap_write_and_wait/filemap_write_and_wait_range/ for this particular
> case, but we definitely want to make changes that keep them consistent.
> 
> > filemap_fdatawait() there however not for EIO returned from
> > filemap_fdatawrite(). In that case we bail out immediately. So I think
> > Jeff's patch is correct but we need to change filemap_write_and_wait() to
> > call also filemap_check_errors() directly on EIO from filemap_fdatawrite().
> 
> So I guess my question was: why is it important that we set AS_EIO, if the EIO
> is already being reported correctly?  Is it just for consistency with the
> buffered fsync case, or is there currently a path where the -EIO from DAX will
> be missed, and we actually need AS_EIO to be set in mapping->flags so that we
> correctly report an error?

It is just for consistency and future-proofing. E.g. if we ever decided to do
some background flushing of CPU caches, current DAX behavior would
suddently result in missed reports of EIO errors.

								Honza
Theodore Y. Ts'o March 9, 2017, 2:57 a.m. UTC | #6
On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> On a more general note (DAX is actually fine here), I find the current
> practice of clearing page dirty bits on error and reporting it just once
> problematic. It keeps the system running but data is lost and possibly
> without getting the error anywhere where it is useful. We get away with
> this because it is a rare event but it seems like a problematic behavior.
> But this is more for the discussion at LSF.

I'm actually running into this in the last day or two because some MM
folks at $WORK have been trying to push hard for GFP_NOFS removal in
ext4 (at least when we are holding some mutex/semaphore like
i_data_sem) because otherwise it's possible for the OOM killer to be
unable to kill processes because they are holding on to locks that
ext4 is holding.

I've done some initial investigation, and while it's not that hard to
remove GFP_NOFS from certain parts of the writepages() codepath (which
is where we had been are running into problems), a really, REALLY big
problem is if any_filesystem->writepages() returns ENOMEM, it causes
silent data loss, because the pages are marked clean, and so data
written using buffered writeback goes *poof*.

I confirmed this by creating a test kernel with a simple patch such
that if the ext4 file system is mounted with -o debug, there was a 1
in 16 chance that ext4_writepages will immediately return with ENOMEM
(and printk the inode number, so I knew which inodes had gotten the
ENOMEM treatment).  The result was **NOT** pretty.

What I think we should strongly consider is at the very least, special
case ENOMEM being returned by writepages() during background
writeback, and *not* mark the pages clean, and make sure the inode
stays on the dirty inode list, so we can retry the write later.  This
is especially important since the process that issued the write may
have gone away, so there might not even be a userspace process to
complain to.  By converting certain page allocations (most notably in
ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
release the i_data_sem lock and return an error.  This should allow
allow the OOM killer to do its dirty deed, and hopefully we can retry
the writepages() for that inode later.

In the case of a data integrity sync being issued by fsync() or
umount(), we could allow ENOMEM to get returned to userspace in that
case as well.  I'm not convinced all userspace code will handle an
ENOMEM correctly or sanely, but at least they people will be (less
likely) to blame file system developers.  :-)

The real problem that's going on here, by the way, is that people are
trying to run programs in insanely tight containers, and then when the
kernel locks up, they blame the mm developers.  But if there is silent
data corruption, they will blame the fs developers instead.  And while
kernel lockups are temporary (all you have to do is let the watchdog
reboot the system :-), silent data corruption is *forever*.  So what
we really need to do is to allow the OOM killer do its work, and if
job owners are unhappy that their processes are getting OOM killed,
maybe they will be suitably incentivized to pay for more memory in
their containers....

						- Ted

P.S. Michael Hocko, apologies for not getting back to you with your
GFP_NOFS removal patches.  But the possibility of fs malfunctions that
might lead to silent data corruption is why I'm being very cautious,
and I now have rather strong confirmation that this is not just an
irrational concern on my part.  (This is also performance review
season, FAST conference was last week, and Usenix ATC program
committee reviews are due this week.  So apologies for any reply
latency.)
Jan Kara March 9, 2017, 9:04 a.m. UTC | #7
On Wed 08-03-17 21:57:25, Ted Tso wrote:
> On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > On a more general note (DAX is actually fine here), I find the current
> > practice of clearing page dirty bits on error and reporting it just once
> > problematic. It keeps the system running but data is lost and possibly
> > without getting the error anywhere where it is useful. We get away with
> > this because it is a rare event but it seems like a problematic behavior.
> > But this is more for the discussion at LSF.
> 
> I'm actually running into this in the last day or two because some MM
> folks at $WORK have been trying to push hard for GFP_NOFS removal in
> ext4 (at least when we are holding some mutex/semaphore like
> i_data_sem) because otherwise it's possible for the OOM killer to be
> unable to kill processes because they are holding on to locks that
> ext4 is holding.
> 
> I've done some initial investigation, and while it's not that hard to
> remove GFP_NOFS from certain parts of the writepages() codepath (which
> is where we had been are running into problems), a really, REALLY big
> problem is if any_filesystem->writepages() returns ENOMEM, it causes
> silent data loss, because the pages are marked clean, and so data
> written using buffered writeback goes *poof*.
> 
> I confirmed this by creating a test kernel with a simple patch such
> that if the ext4 file system is mounted with -o debug, there was a 1
> in 16 chance that ext4_writepages will immediately return with ENOMEM
> (and printk the inode number, so I knew which inodes had gotten the
> ENOMEM treatment).  The result was **NOT** pretty.
> 
> What I think we should strongly consider is at the very least, special
> case ENOMEM being returned by writepages() during background
> writeback, and *not* mark the pages clean, and make sure the inode
> stays on the dirty inode list, so we can retry the write later.  This
> is especially important since the process that issued the write may
> have gone away, so there might not even be a userspace process to
> complain to.  By converting certain page allocations (most notably in
> ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> release the i_data_sem lock and return an error.  This should allow
> allow the OOM killer to do its dirty deed, and hopefully we can retry
> the writepages() for that inode later.

Yeah, so if we can hope the error is transient, keeping pages dirty and
retrying the write is definitely better option. For start we can say that
ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
there's no hope of getting data to disk and so we just discard them. It
will be somewhat rough distinction but probably better than what we have
now.

								Honza
Jeff Layton March 9, 2017, 10:47 a.m. UTC | #8
On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > On a more general note (DAX is actually fine here), I find the current
> > > practice of clearing page dirty bits on error and reporting it just once
> > > problematic. It keeps the system running but data is lost and possibly
> > > without getting the error anywhere where it is useful. We get away with
> > > this because it is a rare event but it seems like a problematic behavior.
> > > But this is more for the discussion at LSF.
> > 
> > I'm actually running into this in the last day or two because some MM
> > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > ext4 (at least when we are holding some mutex/semaphore like
> > i_data_sem) because otherwise it's possible for the OOM killer to be
> > unable to kill processes because they are holding on to locks that
> > ext4 is holding.
> > 
> > I've done some initial investigation, and while it's not that hard to
> > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > is where we had been are running into problems), a really, REALLY big
> > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > silent data loss, because the pages are marked clean, and so data
> > written using buffered writeback goes *poof*.
> > 
> > I confirmed this by creating a test kernel with a simple patch such
> > that if the ext4 file system is mounted with -o debug, there was a 1
> > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > (and printk the inode number, so I knew which inodes had gotten the
> > ENOMEM treatment).  The result was **NOT** pretty.
> > 
> > What I think we should strongly consider is at the very least, special
> > case ENOMEM being returned by writepages() during background
> > writeback, and *not* mark the pages clean, and make sure the inode
> > stays on the dirty inode list, so we can retry the write later.  This
> > is especially important since the process that issued the write may
> > have gone away, so there might not even be a userspace process to
> > complain to.  By converting certain page allocations (most notably in
> > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > release the i_data_sem lock and return an error.  This should allow
> > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > the writepages() for that inode later.
> 
> Yeah, so if we can hope the error is transient, keeping pages dirty and
> retrying the write is definitely better option. For start we can say that
> ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> there's no hope of getting data to disk and so we just discard them. It
> will be somewhat rough distinction but probably better than what we have
> now.
> 
> 								Honza

I'm not sure about ENOSPC there. That's a return code that is
specifically expected to be returned by fsync. It seems like that ought
not be considered a transient error?
Jan Kara March 9, 2017, 11:02 a.m. UTC | #9
On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > On a more general note (DAX is actually fine here), I find the current
> > > > practice of clearing page dirty bits on error and reporting it just once
> > > > problematic. It keeps the system running but data is lost and possibly
> > > > without getting the error anywhere where it is useful. We get away with
> > > > this because it is a rare event but it seems like a problematic behavior.
> > > > But this is more for the discussion at LSF.
> > > 
> > > I'm actually running into this in the last day or two because some MM
> > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > ext4 (at least when we are holding some mutex/semaphore like
> > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > unable to kill processes because they are holding on to locks that
> > > ext4 is holding.
> > > 
> > > I've done some initial investigation, and while it's not that hard to
> > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > is where we had been are running into problems), a really, REALLY big
> > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > silent data loss, because the pages are marked clean, and so data
> > > written using buffered writeback goes *poof*.
> > > 
> > > I confirmed this by creating a test kernel with a simple patch such
> > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > (and printk the inode number, so I knew which inodes had gotten the
> > > ENOMEM treatment).  The result was **NOT** pretty.
> > > 
> > > What I think we should strongly consider is at the very least, special
> > > case ENOMEM being returned by writepages() during background
> > > writeback, and *not* mark the pages clean, and make sure the inode
> > > stays on the dirty inode list, so we can retry the write later.  This
> > > is especially important since the process that issued the write may
> > > have gone away, so there might not even be a userspace process to
> > > complain to.  By converting certain page allocations (most notably in
> > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > release the i_data_sem lock and return an error.  This should allow
> > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > the writepages() for that inode later.
> > 
> > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > retrying the write is definitely better option. For start we can say that
> > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > there's no hope of getting data to disk and so we just discard them. It
> > will be somewhat rough distinction but probably better than what we have
> > now.
> > 
> > 								Honza
> 
> I'm not sure about ENOSPC there. That's a return code that is
> specifically expected to be returned by fsync. It seems like that ought
> not be considered a transient error?

Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
Long term, we may need to make at least some ENOSPC situations behave as
transient to make thin provisioned storage not loose data in case admin
does not supply additional space fast enough (i.e., before ENOSPC is
actually hit).

EIO is actually in a similar bucket although probably more on the "hard
failure" side - I can imagine there can by types of storage and situations
where the loss of connectivity to the storage is only transient. But for
start I would not bother with this.

								Honza
Jeff Layton March 9, 2017, 12:43 p.m. UTC | #10
On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > without getting the error anywhere where it is useful. We get away with
> > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > But this is more for the discussion at LSF.
> > > > 
> > > > I'm actually running into this in the last day or two because some MM
> > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > unable to kill processes because they are holding on to locks that
> > > > ext4 is holding.
> > > > 
> > > > I've done some initial investigation, and while it's not that hard to
> > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > is where we had been are running into problems), a really, REALLY big
> > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > silent data loss, because the pages are marked clean, and so data
> > > > written using buffered writeback goes *poof*.
> > > > 
> > > > I confirmed this by creating a test kernel with a simple patch such
> > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > 
> > > > What I think we should strongly consider is at the very least, special
> > > > case ENOMEM being returned by writepages() during background
> > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > is especially important since the process that issued the write may
> > > > have gone away, so there might not even be a userspace process to
> > > > complain to.  By converting certain page allocations (most notably in
> > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > release the i_data_sem lock and return an error.  This should allow
> > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > the writepages() for that inode later.
> > > 
> > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > retrying the write is definitely better option. For start we can say that
> > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > there's no hope of getting data to disk and so we just discard them. It
> > > will be somewhat rough distinction but probably better than what we have
> > > now.
> > > 
> > > 								Honza
> > 
> > I'm not sure about ENOSPC there. That's a return code that is
> > specifically expected to be returned by fsync. It seems like that ought
> > not be considered a transient error?
> 
> Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> Long term, we may need to make at least some ENOSPC situations behave as
> transient to make thin provisioned storage not loose data in case admin
> does not supply additional space fast enough (i.e., before ENOSPC is
> actually hit).
> 

Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
transient error? Just have it hang until we get enough space when that
tunable is enabled?

> EIO is actually in a similar bucket although probably more on the "hard
> failure" side - I can imagine there can by types of storage and situations
> where the loss of connectivity to the storage is only transient. But for
> start I would not bother with this.
> 
> 								Honza

I don't see what we can reasonably do with -EIO other than return a hard
error. If we want to deal with loss of connectivity to storage as a
transient failure, I think that we'd need to ensure that the lower
layers return more distinct error codes in those cases (ENODEV or ENXIO
maybe? Or declare a new kernel-internal code -- EDEVGONE?).

In any case, I think that the basic idea of marking certain
writepage/writepages/launder_page errors as transient might be a
reasonable approach to handling this sanely.

The problem with all of this though is that we have a pile of existing
code that will likely need to be reworked for the new error handling. I
expect that we'll have to walk all of the
writepage/writepages/launder_page implementations and fix them up one by
one once we sort out the rules for this.
Brian Foster March 9, 2017, 1:22 p.m. UTC | #11
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
> On Thu, 2017-03-09 at 12:02 +0100, Jan Kara wrote:
> > On Thu 09-03-17 05:47:51, Jeff Layton wrote:
> > > On Thu, 2017-03-09 at 10:04 +0100, Jan Kara wrote:
> > > > On Wed 08-03-17 21:57:25, Ted Tso wrote:
> > > > > On Tue, Mar 07, 2017 at 11:26:22AM +0100, Jan Kara wrote:
> > > > > > On a more general note (DAX is actually fine here), I find the current
> > > > > > practice of clearing page dirty bits on error and reporting it just once
> > > > > > problematic. It keeps the system running but data is lost and possibly
> > > > > > without getting the error anywhere where it is useful. We get away with
> > > > > > this because it is a rare event but it seems like a problematic behavior.
> > > > > > But this is more for the discussion at LSF.
> > > > > 
> > > > > I'm actually running into this in the last day or two because some MM
> > > > > folks at $WORK have been trying to push hard for GFP_NOFS removal in
> > > > > ext4 (at least when we are holding some mutex/semaphore like
> > > > > i_data_sem) because otherwise it's possible for the OOM killer to be
> > > > > unable to kill processes because they are holding on to locks that
> > > > > ext4 is holding.
> > > > > 
> > > > > I've done some initial investigation, and while it's not that hard to
> > > > > remove GFP_NOFS from certain parts of the writepages() codepath (which
> > > > > is where we had been are running into problems), a really, REALLY big
> > > > > problem is if any_filesystem->writepages() returns ENOMEM, it causes
> > > > > silent data loss, because the pages are marked clean, and so data
> > > > > written using buffered writeback goes *poof*.
> > > > > 
> > > > > I confirmed this by creating a test kernel with a simple patch such
> > > > > that if the ext4 file system is mounted with -o debug, there was a 1
> > > > > in 16 chance that ext4_writepages will immediately return with ENOMEM
> > > > > (and printk the inode number, so I knew which inodes had gotten the
> > > > > ENOMEM treatment).  The result was **NOT** pretty.
> > > > > 
> > > > > What I think we should strongly consider is at the very least, special
> > > > > case ENOMEM being returned by writepages() during background
> > > > > writeback, and *not* mark the pages clean, and make sure the inode
> > > > > stays on the dirty inode list, so we can retry the write later.  This
> > > > > is especially important since the process that issued the write may
> > > > > have gone away, so there might not even be a userspace process to
> > > > > complain to.  By converting certain page allocations (most notably in
> > > > > ext4_mb_load_buddy) from GFP_NOFS to GFP_KMALLOC, this allows us to
> > > > > release the i_data_sem lock and return an error.  This should allow
> > > > > allow the OOM killer to do its dirty deed, and hopefully we can retry
> > > > > the writepages() for that inode later.
> > > > 
> > > > Yeah, so if we can hope the error is transient, keeping pages dirty and
> > > > retrying the write is definitely better option. For start we can say that
> > > > ENOMEM, EINTR, EAGAIN, ENOSPC errors are transient, anything else means
> > > > there's no hope of getting data to disk and so we just discard them. It
> > > > will be somewhat rough distinction but probably better than what we have
> > > > now.
> > > > 
> > > > 								Honza
> > > 
> > > I'm not sure about ENOSPC there. That's a return code that is
> > > specifically expected to be returned by fsync. It seems like that ought
> > > not be considered a transient error?
> > 
> > Yeah, for start we should probably keep ENOSPC as is to prevent surprises.
> > Long term, we may need to make at least some ENOSPC situations behave as
> > transient to make thin provisioned storage not loose data in case admin
> > does not supply additional space fast enough (i.e., before ENOSPC is
> > actually hit).
> > 
> 
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?
> 

Just FYI, XFS has a similar error configuration mechanism that we use
for essentially this purpose when dealing with metadata buffer I/O
errors.  The original motivation was to help us deal with the varying
requirements for thin provisioning. I.e., whether a particular error is
permanent or transient depends on the admin's preference and affects
whether the filesystem continuously retries failed I/Os in anticipation
of future success or gives up quickly and shuts down (or something in
between).

See roughly commits 192852be8b5 through e6b3bb7896 for the initial code,
/sys/fs/xfs/<dev>/error on an XFS filesystem for the user interface, and
the "Error handling" section of Documentation/filesystems/xfs.txt for
information on how the interface works.

Brian

> > EIO is actually in a similar bucket although probably more on the "hard
> > failure" side - I can imagine there can by types of storage and situations
> > where the loss of connectivity to the storage is only transient. But for
> > start I would not bother with this.
> > 
> > 								Honza
> 
> I don't see what we can reasonably do with -EIO other than return a hard
> error. If we want to deal with loss of connectivity to storage as a
> transient failure, I think that we'd need to ensure that the lower
> layers return more distinct error codes in those cases (ENODEV or ENXIO
> maybe? Or declare a new kernel-internal code -- EDEVGONE?).
> 
> In any case, I think that the basic idea of marking certain
> writepage/writepages/launder_page errors as transient might be a
> reasonable approach to handling this sanely.
> 
> The problem with all of this though is that we have a pile of existing
> code that will likely need to be reworked for the new error handling. I
> expect that we'll have to walk all of the
> writepage/writepages/launder_page implementations and fix them up one by
> one once we sort out the rules for this.
> 
> -- 
> Jeff Layton <jlayton@redhat.com>
Theodore Y. Ts'o March 9, 2017, 2:21 p.m. UTC | #12
On Thu, Mar 09, 2017 at 07:43:12AM -0500, Jeff Layton wrote:
> 
> Maybe we need a systemwide (or fs-level) tunable that makes ENOSPC a
> transient error? Just have it hang until we get enough space when that
> tunable is enabled?

Or maybe we need a new kernel-internal errno (ala ERESTARSYS) which
means it's a "soft ENOSPC"?  It would get translated to ENOSPC if it
gets propagated to userspace, but that way for devices like dm-thin or
other storage array with thin volumes, it could send back a soft
ENOSPC, while for file systems where "ENOSPC means ENOSPC", we can
treat those as a hard ENOSPC.

						- Ted

Patch
diff mbox

diff --git a/fs/dax.c b/fs/dax.c
index c45598b912e1..9005d90deeda 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -888,8 +888,10 @@  int dax_writeback_mapping_range(struct address_space *mapping,
 
 			ret = dax_writeback_one(bdev, mapping, indices[i],
 					pvec.pages[i]);
-			if (ret < 0)
+			if (ret < 0) {
+				mapping_set_error(mapping, ret);
 				return ret;
+			}
 		}
 	}
 	return 0;