mbox series

[v4,00/17] xfs: flush related error handling cleanups

Message ID 20200504141154.55887-1-bfoster@redhat.com (mailing list archive)
Headers show
Series xfs: flush related error handling cleanups | expand

Message

Brian Foster May 4, 2020, 2:11 p.m. UTC
Hi all,

I think everything has been reviewed to this point. Only minor changes
noted below in this release. A git repo is available here[1].

The only outstanding feedback that I'm aware of is Dave's comment on
patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
any means to get through xfs_wait_buftarg() with a dirty buffer that
hasn't undergone the permanant error sequence and thus shut down the fs.
I'm personally fine with any option among dropping the assert, keeping
it or replacing it with a shutdown invocation because atm I don't see it
making much of a functional difference, but I left it alone because the
patch is otherwise reviewed.

Thoughts, reviews, flames appreciated.

Brian

[1] https://github.com/bsfost/linux-xfs/tree/xfs-flush-error-handling-cleanups-v4
[2] https://lore.kernel.org/linux-xfs/20200501112408.GB40250@bfoster/

v4:
- Remove unnecessary xfs_trans_ail_delete() comment.
- Add Fixes: tag to dqflush duplicate verification patch.
v3: https://lore.kernel.org/linux-xfs/20200429172153.41680-1-bfoster@redhat.com/
- Drop flags param from xfs_buf_ioend_fail().
- Fix up iflush error handling patch subject and comments.
- Drop failed buffer ->ail_lock bypass patch.
- Split up AIL removal cleanup patch and dropped switch of call from
  xfs_buf_item_put().
- Rework XBF_WRITE_FAIL to reflect current buffer state.
- Create helper for ratelimited buffer alerts and use appropriately.
- Use BLK_STS_IOERR instead of errno_to_blk_status().
- Drop unused param from xfs_imap_to_bp().
v2: https://lore.kernel.org/linux-xfs/20200422175429.38957-1-bfoster@redhat.com/
- Rename some helper functions.
- Fix up dquot flush verifier instead of removing it.
- Drop quotaoff push handler removal patch.
- Reuse existing ratelimit state for buffer error messages.
- Combine AIL removal helpers.
- Refactor iflush error handling rework to update log item.
- Remove unused shutdown types.
v1: https://lore.kernel.org/linux-xfs/20200417150859.14734-1-bfoster@redhat.com/

Brian Foster (17):
  xfs: refactor failed buffer resubmission into xfsaild
  xfs: factor out buffer I/O failure code
  xfs: simplify inode flush error handling
  xfs: remove unnecessary shutdown check from xfs_iflush()
  xfs: reset buffer write failure state on successful completion
  xfs: refactor ratelimited buffer error messages into helper
  xfs: ratelimit unmount time per-buffer I/O error alert
  xfs: fix duplicate verification from xfs_qm_dqflush()
  xfs: abort consistently on dquot flush failure
  xfs: acquire ->ail_lock from xfs_trans_ail_delete()
  xfs: use delete helper for items expected to be in AIL
  xfs: drop unused shutdown parameter from xfs_trans_ail_remove()
  xfs: combine xfs_trans_ail_[remove|delete]()
  xfs: remove unused iflush stale parameter
  xfs: random buffer write failure errortag
  xfs: remove unused shutdown types
  xfs: remove unused iget_flags param from xfs_imap_to_bp()

 fs/xfs/libxfs/xfs_errortag.h  |   4 +-
 fs/xfs/libxfs/xfs_inode_buf.c |  12 +--
 fs/xfs/libxfs/xfs_inode_buf.h |   2 +-
 fs/xfs/scrub/ialloc.c         |   3 +-
 fs/xfs/xfs_bmap_item.c        |   2 +-
 fs/xfs/xfs_buf.c              |  65 ++++++++++++----
 fs/xfs/xfs_buf.h              |   2 +
 fs/xfs/xfs_buf_item.c         | 106 +++++---------------------
 fs/xfs/xfs_buf_item.h         |   2 -
 fs/xfs/xfs_dquot.c            |  47 +++++-------
 fs/xfs/xfs_dquot_item.c       |  17 +----
 fs/xfs/xfs_error.c            |   3 +
 fs/xfs/xfs_extfree_item.c     |   2 +-
 fs/xfs/xfs_fsops.c            |   5 +-
 fs/xfs/xfs_icache.c           |   2 +-
 fs/xfs/xfs_inode.c            | 139 ++++++++++++----------------------
 fs/xfs/xfs_inode_item.c       |  28 +------
 fs/xfs/xfs_inode_item.h       |   2 +-
 fs/xfs/xfs_log_recover.c      |   2 +-
 fs/xfs/xfs_message.c          |  22 ++++++
 fs/xfs/xfs_message.h          |   3 +
 fs/xfs/xfs_mount.h            |   2 -
 fs/xfs/xfs_refcount_item.c    |   2 +-
 fs/xfs/xfs_rmap_item.c        |   2 +-
 fs/xfs/xfs_trans_ail.c        |  68 +++++++++++------
 fs/xfs/xfs_trans_priv.h       |  18 +----
 26 files changed, 231 insertions(+), 331 deletions(-)

Comments

Dave Chinner May 4, 2020, 9:53 p.m. UTC | #1
On Mon, May 04, 2020 at 10:11:37AM -0400, Brian Foster wrote:
> Hi all,
> 
> I think everything has been reviewed to this point. Only minor changes
> noted below in this release. A git repo is available here[1].
> 
> The only outstanding feedback that I'm aware of is Dave's comment on
> patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
> any means to get through xfs_wait_buftarg() with a dirty buffer that
> hasn't undergone the permanant error sequence and thus shut down the fs.

# echo 0 > /sys/fs/xfs/<dev>/fail_at_unmount

And now any error with a "retry forever" config (the default) will
be collected by xfs_buftarg_wait() without a preceeding shutdown as
xfs_buf_iodone_callback_error() will not treat it as a permanent
error during unmount. i.e. this doesn't trigger:

        /* At unmount we may treat errors differently */
        if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
                goto permanent_error;

and so the error handling just marks it with a write error and lets
it go for a write retry in future. These are then collected in
xfs_buftarg_wait() as nothing is going to retry them once unmount
gets to this point...

Cheers,

Dave.
Brian Foster May 5, 2020, 11:58 a.m. UTC | #2
On Tue, May 05, 2020 at 07:53:07AM +1000, Dave Chinner wrote:
> On Mon, May 04, 2020 at 10:11:37AM -0400, Brian Foster wrote:
> > Hi all,
> > 
> > I think everything has been reviewed to this point. Only minor changes
> > noted below in this release. A git repo is available here[1].
> > 
> > The only outstanding feedback that I'm aware of is Dave's comment on
> > patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
> > any means to get through xfs_wait_buftarg() with a dirty buffer that
> > hasn't undergone the permanant error sequence and thus shut down the fs.
> 
> # echo 0 > /sys/fs/xfs/<dev>/fail_at_unmount
> 
> And now any error with a "retry forever" config (the default) will
> be collected by xfs_buftarg_wait() without a preceeding shutdown as
> xfs_buf_iodone_callback_error() will not treat it as a permanent
> error during unmount. i.e. this doesn't trigger:
> 
>         /* At unmount we may treat errors differently */
>         if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
>                 goto permanent_error;
> 
> and so the error handling just marks it with a write error and lets
> it go for a write retry in future. These are then collected in
> xfs_buftarg_wait() as nothing is going to retry them once unmount
> gets to this point...
> 

That doesn't accurately describe the behavior of that configuration,
though. "Retry forever" means that dirty buffers are going to cycle
through submission retries and the unmount is going to hang indefinitely
(on pushing the AIL). Indeed, preventing this unmount hang is the
original purpose of the fail at unmount knob (commit here[1]).

IOW, we don't get to xfs_wait_buftarg() in that scenario until all dirty
buffers are either written back successfully or the error configuration
changes to process the failures as permanent errors and shuts down the
fs. This can be confirmed easily with the buffer I/O error injection
patch (use a value of 1 to simulate persistent errors).

Brian

[1] e6b3bb78962e6 ("xfs: add "fail at unmount" error handling configuration")

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>
Dave Chinner May 5, 2020, 10:36 p.m. UTC | #3
On Tue, May 05, 2020 at 07:58:10AM -0400, Brian Foster wrote:
> On Tue, May 05, 2020 at 07:53:07AM +1000, Dave Chinner wrote:
> > On Mon, May 04, 2020 at 10:11:37AM -0400, Brian Foster wrote:
> > > Hi all,
> > > 
> > > I think everything has been reviewed to this point. Only minor changes
> > > noted below in this release. A git repo is available here[1].
> > > 
> > > The only outstanding feedback that I'm aware of is Dave's comment on
> > > patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
> > > any means to get through xfs_wait_buftarg() with a dirty buffer that
> > > hasn't undergone the permanant error sequence and thus shut down the fs.
> > 
> > # echo 0 > /sys/fs/xfs/<dev>/fail_at_unmount
> > 
> > And now any error with a "retry forever" config (the default) will
> > be collected by xfs_buftarg_wait() without a preceeding shutdown as
> > xfs_buf_iodone_callback_error() will not treat it as a permanent
> > error during unmount. i.e. this doesn't trigger:
> > 
> >         /* At unmount we may treat errors differently */
> >         if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
> >                 goto permanent_error;
> > 
> > and so the error handling just marks it with a write error and lets
> > it go for a write retry in future. These are then collected in
> > xfs_buftarg_wait() as nothing is going to retry them once unmount
> > gets to this point...
> > 
> 
> That doesn't accurately describe the behavior of that configuration,
> though. "Retry forever" means that dirty buffers are going to cycle
> through submission retries and the unmount is going to hang indefinitely
> (on pushing the AIL). Indeed, preventing this unmount hang is the
> original purpose of the fail at unmount knob (commit here[1]).

So this patch is relying on something else hanging before we get to
xfs_wait_buftarg and hence "we don't have to handle that here".

Please document that assumption along with the ASSERT, because if we
change AIL behaviour to prevent retry-forever hangs with freeze
and/or remount-ro we're going to hit this assert...

Cheers,

Dave.
Brian Foster May 6, 2020, 11:04 a.m. UTC | #4
On Wed, May 06, 2020 at 08:36:43AM +1000, Dave Chinner wrote:
> On Tue, May 05, 2020 at 07:58:10AM -0400, Brian Foster wrote:
> > On Tue, May 05, 2020 at 07:53:07AM +1000, Dave Chinner wrote:
> > > On Mon, May 04, 2020 at 10:11:37AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > I think everything has been reviewed to this point. Only minor changes
> > > > noted below in this release. A git repo is available here[1].
> > > > 
> > > > The only outstanding feedback that I'm aware of is Dave's comment on
> > > > patch 7 of v3 [2] regarding the shutdown assert check. I'm not aware of
> > > > any means to get through xfs_wait_buftarg() with a dirty buffer that
> > > > hasn't undergone the permanant error sequence and thus shut down the fs.
> > > 
> > > # echo 0 > /sys/fs/xfs/<dev>/fail_at_unmount
> > > 
> > > And now any error with a "retry forever" config (the default) will
> > > be collected by xfs_buftarg_wait() without a preceeding shutdown as
> > > xfs_buf_iodone_callback_error() will not treat it as a permanent
> > > error during unmount. i.e. this doesn't trigger:
> > > 
> > >         /* At unmount we may treat errors differently */
> > >         if ((mp->m_flags & XFS_MOUNT_UNMOUNTING) && mp->m_fail_unmount)
> > >                 goto permanent_error;
> > > 
> > > and so the error handling just marks it with a write error and lets
> > > it go for a write retry in future. These are then collected in
> > > xfs_buftarg_wait() as nothing is going to retry them once unmount
> > > gets to this point...
> > > 
> > 
> > That doesn't accurately describe the behavior of that configuration,
> > though. "Retry forever" means that dirty buffers are going to cycle
> > through submission retries and the unmount is going to hang indefinitely
> > (on pushing the AIL). Indeed, preventing this unmount hang is the
> > original purpose of the fail at unmount knob (commit here[1]).
> 
> So this patch is relying on something else hanging before we get to
> xfs_wait_buftarg and hence "we don't have to handle that here".
> 

Yes, though it's the existing code that relies on the external push to
block until everything is written back or a permanent error is issued.
This patch just splits up and rate limits a warning message. I added the
assert because of the previous discussion around making sure this
situation is handled properly and because that error processing
dependency is not obvious.

> Please document that assumption along with the ASSERT, because if we
> change AIL behaviour to prevent retry-forever hangs with freeze
> and/or remount-ro we're going to hit this assert...
> 

Sure, I'll add a comment..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>