xfs: clear XBF_ASYNC if we attach an iodone callback

Message ID	1488552404-21379-1-git-send-email-jbacik@fb.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-xfs-owner@kernel.org> From: Josef Bacik <jbacik@fb.com> To: <linux-xfs@vger.kernel.org>, <darrick.wong@oracle.com>, <kernel-team@fb.com> Subject: [PATCH] xfs: clear XBF_ASYNC if we attach an iodone callback Date: Fri, 3 Mar 2017 09:46:44 -0500 Message-ID: <1488552404-21379-1-git-send-email-jbacik@fb.com> MIME-Version: 1.0 Content-Type: text/plain SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1; BN6PR15MB1905; 7:ZeLV+3MHz4IxOUjwL4MMHiMMFdTaxUmuP66ug/rDXvtWbBBj0dNTD9NyVGQhAmUD8bC/hVfS+DEkmvzz+eAfcQPUe3/WWzMvipohfWaHMJZojIFzZRp4YQu0m45sgkSI7eTtEW+QMV9Qo1VbkM7BzFE78KQHjy7H+LD7N6NR9T9J0WwM2FPgeJtjHNrEGslUtcXn3t4SjS0SXWnYpxIFYmV/oGf9+Mg0GRQzW27Gix+YeeJc+49vM0rNoJ4WKgd7Cih3nUdihXgkEpdhA8q0BdDBm81ms/BxF5t6NwAf/y4cH+iMBv61BVfN+wP8MfD09lAFBsDUE+vaILYDtxVzmQ==; 20:8wm5eL/1j538jMpPda8XHkwBRyQ8v2+yBV98js883px6uYIRUuwCI9kik222FTrrKiQKCzET4OIn0YzsLS5XAMajPgk/Ew9zLXyfqQCpSUGwViccF2qN0NfQNqyrdqBjRsieb/yj2RypMBwTR7wIKHkLr3BXmfhIJT6yUKcBnhI= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 03 Mar 2017 14:46:48.1368 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN6PR15MB1905 Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk

Josef Bacik March 3, 2017, 2:46 p.m. UTC

While testing nbd disconnects I kept hitting the following hang

Call Trace:
 schedule+0x35/0x80
 xfs_ail_push_all_sync+0xa3/0xe0
 ? prepare_to_wait_event+0x100/0x100
 xfs_unmountfs+0x57/0x190
 xfs_fs_put_super+0x32/0x90
 generic_shutdown_super+0x6f/0xf0
 kill_block_super+0x27/0x70
 deactivate_locked_super+0x3e/0x70
 deactivate_super+0x46/0x60
 cleanup_mnt+0x3f/0x80
 __cleanup_mnt+0x12/0x20
 task_work_run+0x86/0xb0
 exit_to_usermode_loop+0x6d/0x96
 do_syscall_64+0x8b/0xa0
 entry_SYSCALL64_slow_path+0x25/0x25

After some digging around I found that there was a log item on the ail
with a callback of xfs_iflush_done.  A printk confirmed that at the call
to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already set,
which means on error we do not call xfs_buf_do_callbacks, which leaves
the log item on the ail list which causes us to hang on unmount.  I
assume the block has XBF_ASYNC set because we did a readahead on it, so
it doesn't really need to have XBF_ASYNC set at this point as we do
actually care about what happens to the buffer once IO is complete.
With this patch my hang no longer happens.

Signed-off-by: Josef Bacik <jbacik@fb.com>
---
 fs/xfs/xfs_buf_item.c | 5 +++++
 1 file changed, 5 insertions(+)

Brian Foster March 3, 2017, 8:49 p.m. UTC | #1

On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik wrote:
> While testing nbd disconnects I kept hitting the following hang
> 
> Call Trace:
>  schedule+0x35/0x80
>  xfs_ail_push_all_sync+0xa3/0xe0
>  ? prepare_to_wait_event+0x100/0x100
>  xfs_unmountfs+0x57/0x190
>  xfs_fs_put_super+0x32/0x90
>  generic_shutdown_super+0x6f/0xf0
>  kill_block_super+0x27/0x70
>  deactivate_locked_super+0x3e/0x70
>  deactivate_super+0x46/0x60
>  cleanup_mnt+0x3f/0x80
>  __cleanup_mnt+0x12/0x20
>  task_work_run+0x86/0xb0
>  exit_to_usermode_loop+0x6d/0x96
>  do_syscall_64+0x8b/0xa0
>  entry_SYSCALL64_slow_path+0x25/0x25
> 
> After some digging around I found that there was a log item on the ail
> with a callback of xfs_iflush_done.  A printk confirmed that at the call
> to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already set,
> which means on error we do not call xfs_buf_do_callbacks, which leaves
> the log item on the ail list which causes us to hang on unmount.  I
> assume the block has XBF_ASYNC set because we did a readahead on it, so
> it doesn't really need to have XBF_ASYNC set at this point as we do
> actually care about what happens to the buffer once IO is complete.
> With this patch my hang no longer happens.
> 
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> ---
>  fs/xfs/xfs_buf_item.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 2975cb2..24fcb67 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
>  
>  	ASSERT(bp->b_iodone == NULL ||
>  	       bp->b_iodone == xfs_buf_iodone_callbacks);
> +	/*
> +	 * Somebody now cares about the fate of this buffer, clear XBF_ASYNC so
> +	 * that the iodone callback actually gets called.
> +	 */
> +	bp->b_flags &= ~XBF_ASYNC;

XBF_ASYNC generally describes the type of I/O being submitted for a
particular buffer (blocking or non), so it being set or not is kind of
outside the scope of xfs_iflush(). It really depends on how the caller
ultimately wants to write the buffer. For example, if we get here via
xfsaild_push()->xfs_inode_item_push(), the latter will do
xfs_buf_delwri_queue() and the former will eventually call
xfs_buf_delwri_submit_nowait(), which submits async I/O and so will
re-add XBF_ASYNC.

It is interesting if this prevents your problem. I can't quite make out
why that would be. The other xfs_iflush() caller is reclaim, which calls
xfs_bwrite() which itself clears XBF_ASYNC before submission.

What I'm guessing you're referring to above wrt to not calling
xfs_buf_do_callbacks() is the async check in
xfs_buf_iodone_callback_error(), which is only relevant on I/O error. If
the buffer I/O succeeds, we should always invoke the iodone callbacks
regardless of async state. Do you observe otherwise or does this I/O
indeed fail?

Anyways, if we change the buffer flags simply to control the behavior of
iodone_callback_error(), we've basically decided to treat an otherwise
async I/O as sync with the assumption that the submitter is going to
handle the I/O error. As mentioned above, we don't really have that
information here so that is kind of a dangerous assumption. At the very
least, it bypasses all of the configurable error handling infrastructure
that is specifically in place for async submitted buffers.

I'm wondering if you're hitting the flush locked inode retry problem
that Carlos has been working on. The problem is basically that if the
I/O fails on a buffer that backs a flush locked inode, we never retry
the buffer and xfsaild basically spins on pushing the inode. What
happens if you enable xfs tracepoints while the fs is hung?

Brian

>  	bp->b_iodone = xfs_buf_iodone_callbacks;
>  }
>  
> -- 
> 2.5.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik March 3, 2017, 9:34 p.m. UTC | #2

On Fri, 2017-03-03 at 15:49 -0500, Brian Foster wrote:
> On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik wrote:
> > 
> > While testing nbd disconnects I kept hitting the following hang
> > 
> > Call Trace:
> >  schedule+0x35/0x80
> >  xfs_ail_push_all_sync+0xa3/0xe0
> >  ? prepare_to_wait_event+0x100/0x100
> >  xfs_unmountfs+0x57/0x190
> >  xfs_fs_put_super+0x32/0x90
> >  generic_shutdown_super+0x6f/0xf0
> >  kill_block_super+0x27/0x70
> >  deactivate_locked_super+0x3e/0x70
> >  deactivate_super+0x46/0x60
> >  cleanup_mnt+0x3f/0x80
> >  __cleanup_mnt+0x12/0x20
> >  task_work_run+0x86/0xb0
> >  exit_to_usermode_loop+0x6d/0x96
> >  do_syscall_64+0x8b/0xa0
> >  entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > After some digging around I found that there was a log item on the
> > ail
> > with a callback of xfs_iflush_done.  A printk confirmed that at the
> > call
> > to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already
> > set,
> > which means on error we do not call xfs_buf_do_callbacks, which
> > leaves
> > the log item on the ail list which causes us to hang on unmount.  I
> > assume the block has XBF_ASYNC set because we did a readahead on
> > it, so
> > it doesn't really need to have XBF_ASYNC set at this point as we do
> > actually care about what happens to the buffer once IO is complete.
> > With this patch my hang no longer happens.
> > 
> > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > ---
> >  fs/xfs/xfs_buf_item.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > index 2975cb2..24fcb67 100644
> > --- a/fs/xfs/xfs_buf_item.c
> > +++ b/fs/xfs/xfs_buf_item.c
> > @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
> >  
> >  	ASSERT(bp->b_iodone == NULL ||
> >  	       bp->b_iodone == xfs_buf_iodone_callbacks);
> > +	/*
> > +	 * Somebody now cares about the fate of this buffer, clear
> > XBF_ASYNC so
> > +	 * that the iodone callback actually gets called.
> > +	 */
> > +	bp->b_flags &= ~XBF_ASYNC;
> XBF_ASYNC generally describes the type of I/O being submitted for a
> particular buffer (blocking or non), so it being set or not is kind
> of
> outside the scope of xfs_iflush(). It really depends on how the
> caller
> ultimately wants to write the buffer. For example, if we get here via
> xfsaild_push()->xfs_inode_item_push(), the latter will do
> xfs_buf_delwri_queue() and the former will eventually call
> xfs_buf_delwri_submit_nowait(), which submits async I/O and so will
> re-add XBF_ASYNC.
> 
> It is interesting if this prevents your problem. I can't quite make
> out
> why that would be. The other xfs_iflush() caller is reclaim, which
> calls
> xfs_bwrite() which itself clears XBF_ASYNC before submission.
> 
> What I'm guessing you're referring to above wrt to not calling
> xfs_buf_do_callbacks() is the async check in
> xfs_buf_iodone_callback_error(), which is only relevant on I/O error.
> If
> the buffer I/O succeeds, we should always invoke the iodone callbacks
> regardless of async state. Do you observe otherwise or does this I/O
> indeed fail?
> 
> Anyways, if we change the buffer flags simply to control the behavior
> of
> iodone_callback_error(), we've basically decided to treat an
> otherwise
> async I/O as sync with the assumption that the submitter is going to
> handle the I/O error. As mentioned above, we don't really have that
> information here so that is kind of a dangerous assumption. At the
> very
> least, it bypasses all of the configurable error handling
> infrastructure
> that is specifically in place for async submitted buffers.
> 
> I'm wondering if you're hitting the flush locked inode retry problem
> that Carlos has been working on. The problem is basically that if the
> I/O fails on a buffer that backs a flush locked inode, we never retry
> the buffer and xfsaild basically spins on pushing the inode. What
> happens if you enable xfs tracepoints while the fs is hung?

Sorry for some reason my reply to linux-xfs got eaten.  It made it to
Darrick but vger is being weird (sometimes it fails to get an email,
sometimes it works fine).

So yes my patch doesn't actually work, I just didn't notice that the
problem had happened until after I sent the page.

I'm doing this on a 4.6 kernel, I haven't actually tried it on a
upstream kernel but my original assumption was it was still available
to hit on an upstream kernel.  Basically on 4.6 we _always_ resubmit
the IO once if it fails, and then just drop the buffer if we've re-
submitted once.  We never do the xfs_buf_do_callbacks() call, which
seems wrong since as you say basically everything will set XBF_ASYNC.

What I've done to fix it for 4.6 was in the second case where we just
drop the buffer was to go ahead and do the xfs_buf_do_callbacks() call
and then clear the fspriv and iodone callback on the buffer, and that
actually made the problem go away.  Looking at the upstream code if
it's a permanent IO error we still do the callbacks, so I think the
problem I've hit has already been fixed upstream.  I'll verify next
week just to be sure, since I can usually hit the problem in less than
5 minutes of my reproducer.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 3, 2017, 10:20 p.m. UTC | #3

On Fri, Mar 03, 2017 at 04:34:16PM -0500, Josef Bacik wrote:
> On Fri, 2017-03-03 at 15:49 -0500, Brian Foster wrote:
> > On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik wrote:
> > > 
> > > While testing nbd disconnects I kept hitting the following hang
> > > 
> > > Call Trace:
> > >  schedule+0x35/0x80
> > >  xfs_ail_push_all_sync+0xa3/0xe0
> > >  ? prepare_to_wait_event+0x100/0x100
> > >  xfs_unmountfs+0x57/0x190
> > >  xfs_fs_put_super+0x32/0x90
> > >  generic_shutdown_super+0x6f/0xf0
> > >  kill_block_super+0x27/0x70
> > >  deactivate_locked_super+0x3e/0x70
> > >  deactivate_super+0x46/0x60
> > >  cleanup_mnt+0x3f/0x80
> > >  __cleanup_mnt+0x12/0x20
> > >  task_work_run+0x86/0xb0
> > >  exit_to_usermode_loop+0x6d/0x96
> > >  do_syscall_64+0x8b/0xa0
> > >  entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > After some digging around I found that there was a log item on the
> > > ail
> > > with a callback of xfs_iflush_done.  A printk confirmed that at the
> > > call
> > > to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already
> > > set,
> > > which means on error we do not call xfs_buf_do_callbacks, which
> > > leaves
> > > the log item on the ail list which causes us to hang on unmount.  I
> > > assume the block has XBF_ASYNC set because we did a readahead on
> > > it, so
> > > it doesn't really need to have XBF_ASYNC set at this point as we do
> > > actually care about what happens to the buffer once IO is complete.
> > > With this patch my hang no longer happens.
> > > 
> > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > ---
> > >  fs/xfs/xfs_buf_item.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > index 2975cb2..24fcb67 100644
> > > --- a/fs/xfs/xfs_buf_item.c
> > > +++ b/fs/xfs/xfs_buf_item.c
> > > @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
> > >  
> > >  	ASSERT(bp->b_iodone == NULL ||
> > >  	       bp->b_iodone == xfs_buf_iodone_callbacks);
> > > +	/*
> > > +	 * Somebody now cares about the fate of this buffer, clear
> > > XBF_ASYNC so
> > > +	 * that the iodone callback actually gets called.
> > > +	 */
> > > +	bp->b_flags &= ~XBF_ASYNC;
> > XBF_ASYNC generally describes the type of I/O being submitted for a
> > particular buffer (blocking or non), so it being set or not is kind
> > of
> > outside the scope of xfs_iflush(). It really depends on how the
> > caller
> > ultimately wants to write the buffer. For example, if we get here via
> > xfsaild_push()->xfs_inode_item_push(), the latter will do
> > xfs_buf_delwri_queue() and the former will eventually call
> > xfs_buf_delwri_submit_nowait(), which submits async I/O and so will
> > re-add XBF_ASYNC.
> > 
> > It is interesting if this prevents your problem. I can't quite make
> > out
> > why that would be. The other xfs_iflush() caller is reclaim, which
> > calls
> > xfs_bwrite() which itself clears XBF_ASYNC before submission.
> > 
> > What I'm guessing you're referring to above wrt to not calling
> > xfs_buf_do_callbacks() is the async check in
> > xfs_buf_iodone_callback_error(), which is only relevant on I/O error.
> > If
> > the buffer I/O succeeds, we should always invoke the iodone callbacks
> > regardless of async state. Do you observe otherwise or does this I/O
> > indeed fail?
> > 
> > Anyways, if we change the buffer flags simply to control the behavior
> > of
> > iodone_callback_error(), we've basically decided to treat an
> > otherwise
> > async I/O as sync with the assumption that the submitter is going to
> > handle the I/O error. As mentioned above, we don't really have that
> > information here so that is kind of a dangerous assumption. At the
> > very
> > least, it bypasses all of the configurable error handling
> > infrastructure
> > that is specifically in place for async submitted buffers.
> > 
> > I'm wondering if you're hitting the flush locked inode retry problem
> > that Carlos has been working on. The problem is basically that if the
> > I/O fails on a buffer that backs a flush locked inode, we never retry
> > the buffer and xfsaild basically spins on pushing the inode. What
> > happens if you enable xfs tracepoints while the fs is hung?
> 
> Sorry for some reason my reply to linux-xfs got eaten.  It made it to
> Darrick but vger is being weird (sometimes it fails to get an email,
> sometimes it works fine).
> 
> So yes my patch doesn't actually work, I just didn't notice that the
> problem had happened until after I sent the page.
> 
> I'm doing this on a 4.6 kernel, I haven't actually tried it on a
> upstream kernel but my original assumption was it was still available
> to hit on an upstream kernel.  Basically on 4.6 we _always_ resubmit
> the IO once if it fails, and then just drop the buffer if we've re-
> submitted once.  We never do the xfs_buf_do_callbacks() call, which
> seems wrong since as you say basically everything will set XBF_ASYNC.
> 

The XBF_WRITE_FAIL resubmission is an internal retry thing intended to
deal with transient errors. Assuming the internal retry fails, we do
release the buffer but note that the higher level item is still tracked
on the AIL (which basically just tracks that the inode is in the
physical log and we're waiting on the metadata writeback to complete to
free up space in the log).

IOWs, while the I/O path doesn't retry and releases the buffer, the
intent is that the next xfsaild_push() cycle will retry the I/O (this is
the part that is actually broken for flush locked inodes).

> What I've done to fix it for 4.6 was in the second case where we just
> drop the buffer was to go ahead and do the xfs_buf_do_callbacks() call
> and then clear the fspriv and iodone callback on the buffer, and that
> actually made the problem go away.  Looking at the upstream code if
> it's a permanent IO error we still do the callbacks, so I think the
> problem I've hit has already been fixed upstream.  I'll verify next
> week just to be sure, since I can usually hit the problem in less than
> 5 minutes of my reproducer.  Thanks,
> 

Ok, just be aware that while that might allow your hung unmount case to
progress, what you're doing in the general/runtime case is treating a
buffer that fails the I/O twice as successfully completed. E.g., see
that xfs_iflush_done() invokes xfs_trans_ail_delete_bulk() which removes
the inode item from the AIL as if it were written back to its final
location on disk.

It might not be straightforward to reproduce because in practice I/O
errors won't occur, may mean the device is broken and the fs is going to
shutdown anyways, or some other dirty inode backed by the same buffer
might flush it, but otherwise this sounds like a path to corruption. For
example, suppose we modify an inode, commit the transaction etc. such
that the inode makes it to the log. At this point, the buffer is now
available for writeback. Say we attempt the buffer writeback and it
fails both initially and for the internal retry. IIUC, the change you've
made will basically release the inode from the AIL at this point as if
it succeeded, but clearly the updated inode has not made it to disk. If
the device otherwise goes back to normal and no more I/O errors occur,
we've basically just thrown away the change to the inode and corrupted
the fs.

Brian

> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik March 4, 2017, 1:26 a.m. UTC | #4

> On Mar 3, 2017, at 5:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> 
>> On Fri, Mar 03, 2017 at 04:34:16PM -0500, Josef Bacik wrote:
>>> On Fri, 2017-03-03 at 15:49 -0500, Brian Foster wrote:
>>>> On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik wrote:
>>>> 
>>>> While testing nbd disconnects I kept hitting the following hang
>>>> 
>>>> Call Trace:
>>>>  schedule+0x35/0x80
>>>>  xfs_ail_push_all_sync+0xa3/0xe0
>>>>  ? prepare_to_wait_event+0x100/0x100
>>>>  xfs_unmountfs+0x57/0x190
>>>>  xfs_fs_put_super+0x32/0x90
>>>>  generic_shutdown_super+0x6f/0xf0
>>>>  kill_block_super+0x27/0x70
>>>>  deactivate_locked_super+0x3e/0x70
>>>>  deactivate_super+0x46/0x60
>>>>  cleanup_mnt+0x3f/0x80
>>>>  __cleanup_mnt+0x12/0x20
>>>>  task_work_run+0x86/0xb0
>>>>  exit_to_usermode_loop+0x6d/0x96
>>>>  do_syscall_64+0x8b/0xa0
>>>>  entry_SYSCALL64_slow_path+0x25/0x25
>>>> 
>>>> After some digging around I found that there was a log item on the
>>>> ail
>>>> with a callback of xfs_iflush_done.  A printk confirmed that at the
>>>> call
>>>> to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already
>>>> set,
>>>> which means on error we do not call xfs_buf_do_callbacks, which
>>>> leaves
>>>> the log item on the ail list which causes us to hang on unmount.  I
>>>> assume the block has XBF_ASYNC set because we did a readahead on
>>>> it, so
>>>> it doesn't really need to have XBF_ASYNC set at this point as we do
>>>> actually care about what happens to the buffer once IO is complete.
>>>> With this patch my hang no longer happens.
>>>> 
>>>> Signed-off-by: Josef Bacik <jbacik@fb.com>
>>>> ---
>>>>  fs/xfs/xfs_buf_item.c | 5 +++++
>>>>  1 file changed, 5 insertions(+)
>>>> 
>>>> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
>>>> index 2975cb2..24fcb67 100644
>>>> --- a/fs/xfs/xfs_buf_item.c
>>>> +++ b/fs/xfs/xfs_buf_item.c
>>>> @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
>>>>  
>>>>      ASSERT(bp->b_iodone == NULL ||
>>>>             bp->b_iodone == xfs_buf_iodone_callbacks);
>>>> +    /*
>>>> +     * Somebody now cares about the fate of this buffer, clear
>>>> XBF_ASYNC so
>>>> +     * that the iodone callback actually gets called.
>>>> +     */
>>>> +    bp->b_flags &= ~XBF_ASYNC;
>>> XBF_ASYNC generally describes the type of I/O being submitted for a
>>> particular buffer (blocking or non), so it being set or not is kind
>>> of
>>> outside the scope of xfs_iflush(). It really depends on how the
>>> caller
>>> ultimately wants to write the buffer. For example, if we get here via
>>> xfsaild_push()->xfs_inode_item_push(), the latter will do
>>> xfs_buf_delwri_queue() and the former will eventually call
>>> xfs_buf_delwri_submit_nowait(), which submits async I/O and so will
>>> re-add XBF_ASYNC.
>>> 
>>> It is interesting if this prevents your problem. I can't quite make
>>> out
>>> why that would be. The other xfs_iflush() caller is reclaim, which
>>> calls
>>> xfs_bwrite() which itself clears XBF_ASYNC before submission.
>>> 
>>> What I'm guessing you're referring to above wrt to not calling
>>> xfs_buf_do_callbacks() is the async check in
>>> xfs_buf_iodone_callback_error(), which is only relevant on I/O error.
>>> If
>>> the buffer I/O succeeds, we should always invoke the iodone callbacks
>>> regardless of async state. Do you observe otherwise or does this I/O
>>> indeed fail?
>>> 
>>> Anyways, if we change the buffer flags simply to control the behavior
>>> of
>>> iodone_callback_error(), we've basically decided to treat an
>>> otherwise
>>> async I/O as sync with the assumption that the submitter is going to
>>> handle the I/O error. As mentioned above, we don't really have that
>>> information here so that is kind of a dangerous assumption. At the
>>> very
>>> least, it bypasses all of the configurable error handling
>>> infrastructure
>>> that is specifically in place for async submitted buffers.
>>> 
>>> I'm wondering if you're hitting the flush locked inode retry problem
>>> that Carlos has been working on. The problem is basically that if the
>>> I/O fails on a buffer that backs a flush locked inode, we never retry
>>> the buffer and xfsaild basically spins on pushing the inode. What
>>> happens if you enable xfs tracepoints while the fs is hung?
>> 
>> Sorry for some reason my reply to linux-xfs got eaten.  It made it to
>> Darrick but vger is being weird (sometimes it fails to get an email,
>> sometimes it works fine).
>> 
>> So yes my patch doesn't actually work, I just didn't notice that the
>> problem had happened until after I sent the page.
>> 
>> I'm doing this on a 4.6 kernel, I haven't actually tried it on a
>> upstream kernel but my original assumption was it was still available
>> to hit on an upstream kernel.  Basically on 4.6 we _always_ resubmit
>> the IO once if it fails, and then just drop the buffer if we've re-
>> submitted once.  We never do the xfs_buf_do_callbacks() call, which
>> seems wrong since as you say basically everything will set XBF_ASYNC.
>> 
> 
> The XBF_WRITE_FAIL resubmission is an internal retry thing intended to
> deal with transient errors. Assuming the internal retry fails, we do
> release the buffer but note that the higher level item is still tracked
> on the AIL (which basically just tracks that the inode is in the
> physical log and we're waiting on the metadata writeback to complete to
> free up space in the log).
> 
> IOWs, while the I/O path doesn't retry and releases the buffer, the
> intent is that the next xfsaild_push() cycle will retry the I/O (this is
> the part that is actually broken for flush locked inodes).

Yeah this sounds like what is happening, as I'll see xfs errors every N seconds while its hanging, so it's retrying the IO, it's just not failing out gracefully. What would be the course of action here?  I can apply things and run them here pretty quickly.

> 
>> What I've done to fix it for 4.6 was in the second case where we just
>> drop the buffer was to go ahead and do the xfs_buf_do_callbacks() call
>> and then clear the fspriv and iodone callback on the buffer, and that
>> actually made the problem go away.  Looking at the upstream code if
>> it's a permanent IO error we still do the callbacks, so I think the
>> problem I've hit has already been fixed upstream.  I'll verify next
>> week just to be sure, since I can usually hit the problem in less than
>> 5 minutes of my reproducer.  Thanks,
>> 
> 
> Ok, just be aware that while that might allow your hung unmount case to
> progress, what you're doing in the general/runtime case is treating a
> buffer that fails the I/O twice as successfully completed. E.g., see
> that xfs_iflush_done() invokes xfs_trans_ail_delete_bulk() which removes
> the inode item from the AIL as if it were written back to its final
> location on disk.
> 
> It might not be straightforward to reproduce because in practice I/O
> errors won't occur, may mean the device is broken and the fs is going to
> shutdown anyways, or some other dirty inode backed by the same buffer
> might flush it, but otherwise this sounds like a path to corruption. For
> example, suppose we modify an inode, commit the transaction etc. such
> that the inode makes it to the log. At this point, the buffer is now
> available for writeback. Say we attempt the buffer writeback and it
> fails both initially and for the internal retry. IIUC, the change you've
> made will basically release the inode from the AIL at this point as if
> it succeeded, but clearly the updated inode has not made it to disk. If
> the device otherwise goes back to normal and no more I/O errors occur,
> we've basically just thrown away the change to the inode and corrupted
> the fs.

Well thankfully this is when the device goes completely away so I'm not worried about that, but yeah that's not an excellent general solution.  Though it looked like it was tripping the shutdown code in the iflush code, but this isn't my area.  Thanks,

Josef--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 4, 2017, 1:58 p.m. UTC | #5

cc Carlos

On Sat, Mar 04, 2017 at 01:26:50AM +0000, Josef Bacik wrote:
> 
> > On Mar 3, 2017, at 5:20 PM, Brian Foster <bfoster@redhat.com> wrote:
> > 
> >> On Fri, Mar 03, 2017 at 04:34:16PM -0500, Josef Bacik wrote:
> >>> On Fri, 2017-03-03 at 15:49 -0500, Brian Foster wrote:
> >>>> On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik wrote:
> >>>> 
> >>>> While testing nbd disconnects I kept hitting the following hang
> >>>> 
> >>>> Call Trace:
> >>>>  schedule+0x35/0x80
> >>>>  xfs_ail_push_all_sync+0xa3/0xe0
> >>>>  ? prepare_to_wait_event+0x100/0x100
> >>>>  xfs_unmountfs+0x57/0x190
> >>>>  xfs_fs_put_super+0x32/0x90
> >>>>  generic_shutdown_super+0x6f/0xf0
> >>>>  kill_block_super+0x27/0x70
> >>>>  deactivate_locked_super+0x3e/0x70
> >>>>  deactivate_super+0x46/0x60
> >>>>  cleanup_mnt+0x3f/0x80
> >>>>  __cleanup_mnt+0x12/0x20
> >>>>  task_work_run+0x86/0xb0
> >>>>  exit_to_usermode_loop+0x6d/0x96
> >>>>  do_syscall_64+0x8b/0xa0
> >>>>  entry_SYSCALL64_slow_path+0x25/0x25
> >>>> 
> >>>> After some digging around I found that there was a log item on the
> >>>> ail
> >>>> with a callback of xfs_iflush_done.  A printk confirmed that at the
> >>>> call
> >>>> to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC already
> >>>> set,
> >>>> which means on error we do not call xfs_buf_do_callbacks, which
> >>>> leaves
> >>>> the log item on the ail list which causes us to hang on unmount.  I
> >>>> assume the block has XBF_ASYNC set because we did a readahead on
> >>>> it, so
> >>>> it doesn't really need to have XBF_ASYNC set at this point as we do
> >>>> actually care about what happens to the buffer once IO is complete.
> >>>> With this patch my hang no longer happens.
> >>>> 
> >>>> Signed-off-by: Josef Bacik <jbacik@fb.com>
> >>>> ---
> >>>>  fs/xfs/xfs_buf_item.c | 5 +++++
> >>>>  1 file changed, 5 insertions(+)
> >>>> 
> >>>> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> >>>> index 2975cb2..24fcb67 100644
> >>>> --- a/fs/xfs/xfs_buf_item.c
> >>>> +++ b/fs/xfs/xfs_buf_item.c
> >>>> @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
> >>>>  
> >>>>      ASSERT(bp->b_iodone == NULL ||
> >>>>             bp->b_iodone == xfs_buf_iodone_callbacks);
> >>>> +    /*
> >>>> +     * Somebody now cares about the fate of this buffer, clear
> >>>> XBF_ASYNC so
> >>>> +     * that the iodone callback actually gets called.
> >>>> +     */
> >>>> +    bp->b_flags &= ~XBF_ASYNC;
> >>> XBF_ASYNC generally describes the type of I/O being submitted for a
> >>> particular buffer (blocking or non), so it being set or not is kind
> >>> of
> >>> outside the scope of xfs_iflush(). It really depends on how the
> >>> caller
> >>> ultimately wants to write the buffer. For example, if we get here via
> >>> xfsaild_push()->xfs_inode_item_push(), the latter will do
> >>> xfs_buf_delwri_queue() and the former will eventually call
> >>> xfs_buf_delwri_submit_nowait(), which submits async I/O and so will
> >>> re-add XBF_ASYNC.
> >>> 
> >>> It is interesting if this prevents your problem. I can't quite make
> >>> out
> >>> why that would be. The other xfs_iflush() caller is reclaim, which
> >>> calls
> >>> xfs_bwrite() which itself clears XBF_ASYNC before submission.
> >>> 
> >>> What I'm guessing you're referring to above wrt to not calling
> >>> xfs_buf_do_callbacks() is the async check in
> >>> xfs_buf_iodone_callback_error(), which is only relevant on I/O error.
> >>> If
> >>> the buffer I/O succeeds, we should always invoke the iodone callbacks
> >>> regardless of async state. Do you observe otherwise or does this I/O
> >>> indeed fail?
> >>> 
> >>> Anyways, if we change the buffer flags simply to control the behavior
> >>> of
> >>> iodone_callback_error(), we've basically decided to treat an
> >>> otherwise
> >>> async I/O as sync with the assumption that the submitter is going to
> >>> handle the I/O error. As mentioned above, we don't really have that
> >>> information here so that is kind of a dangerous assumption. At the
> >>> very
> >>> least, it bypasses all of the configurable error handling
> >>> infrastructure
> >>> that is specifically in place for async submitted buffers.
> >>> 
> >>> I'm wondering if you're hitting the flush locked inode retry problem
> >>> that Carlos has been working on. The problem is basically that if the
> >>> I/O fails on a buffer that backs a flush locked inode, we never retry
> >>> the buffer and xfsaild basically spins on pushing the inode. What
> >>> happens if you enable xfs tracepoints while the fs is hung?
> >> 
> >> Sorry for some reason my reply to linux-xfs got eaten.  It made it to
> >> Darrick but vger is being weird (sometimes it fails to get an email,
> >> sometimes it works fine).
> >> 
> >> So yes my patch doesn't actually work, I just didn't notice that the
> >> problem had happened until after I sent the page.
> >> 
> >> I'm doing this on a 4.6 kernel, I haven't actually tried it on a
> >> upstream kernel but my original assumption was it was still available
> >> to hit on an upstream kernel.  Basically on 4.6 we _always_ resubmit
> >> the IO once if it fails, and then just drop the buffer if we've re-
> >> submitted once.  We never do the xfs_buf_do_callbacks() call, which
> >> seems wrong since as you say basically everything will set XBF_ASYNC.
> >> 
> > 
> > The XBF_WRITE_FAIL resubmission is an internal retry thing intended to
> > deal with transient errors. Assuming the internal retry fails, we do
> > release the buffer but note that the higher level item is still tracked
> > on the AIL (which basically just tracks that the inode is in the
> > physical log and we're waiting on the metadata writeback to complete to
> > free up space in the log).
> > 
> > IOWs, while the I/O path doesn't retry and releases the buffer, the
> > intent is that the next xfsaild_push() cycle will retry the I/O (this is
> > the part that is actually broken for flush locked inodes).
> 
> Yeah this sounds like what is happening, as I'll see xfs errors every N seconds while its hanging, so it's retrying the IO, it's just not failing out gracefully. What would be the course of action here?  I can apply things and run them here pretty quickly.
> 

If it is actually the flush lock issue, Carlos was working on a proper
fix for that. I'm not sure where that stands currently, however..
Carlos?

If there isn't anything available to test quite yet, there was also a
poc patch[1] more to verify what the problem was. I believe that one is
at least safe, but still probably not production worthy as it may have
performance ramifications. It should at least tell you whether that is
the problem or something else is going on here.

Oh, commit e6b3bb789 ("xfs: add "fail at unmount" error handling
configuration") may also be interesting here if the fs doesn't shutdown
on unmount. That one looks like it came after v4.6 as well.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg01018.html

> > 
> >> What I've done to fix it for 4.6 was in the second case where we just
> >> drop the buffer was to go ahead and do the xfs_buf_do_callbacks() call
> >> and then clear the fspriv and iodone callback on the buffer, and that
> >> actually made the problem go away.  Looking at the upstream code if
> >> it's a permanent IO error we still do the callbacks, so I think the
> >> problem I've hit has already been fixed upstream.  I'll verify next
> >> week just to be sure, since I can usually hit the problem in less than
> >> 5 minutes of my reproducer.  Thanks,
> >> 
> > 
> > Ok, just be aware that while that might allow your hung unmount case to
> > progress, what you're doing in the general/runtime case is treating a
> > buffer that fails the I/O twice as successfully completed. E.g., see
> > that xfs_iflush_done() invokes xfs_trans_ail_delete_bulk() which removes
> > the inode item from the AIL as if it were written back to its final
> > location on disk.
> > 
> > It might not be straightforward to reproduce because in practice I/O
> > errors won't occur, may mean the device is broken and the fs is going to
> > shutdown anyways, or some other dirty inode backed by the same buffer
> > might flush it, but otherwise this sounds like a path to corruption. For
> > example, suppose we modify an inode, commit the transaction etc. such
> > that the inode makes it to the log. At this point, the buffer is now
> > available for writeback. Say we attempt the buffer writeback and it
> > fails both initially and for the internal retry. IIUC, the change you've
> > made will basically release the inode from the AIL at this point as if
> > it succeeded, but clearly the updated inode has not made it to disk. If
> > the device otherwise goes back to normal and no more I/O errors occur,
> > we've basically just thrown away the change to the inode and corrupted
> > the fs.
> 
> Well thankfully this is when the device goes completely away so I'm not worried about that, but yeah that's not an excellent general solution.  Though it looked like it was tripping the shutdown code in the iflush code, but this isn't my area.  Thanks,
> 
> Josef--
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik March 6, 2017, 7 p.m. UTC | #6

On Sat, 2017-03-04 at 08:58 -0500, Brian Foster wrote:
> cc Carlos
> 
> On Sat, Mar 04, 2017 at 01:26:50AM +0000, Josef Bacik wrote:
> > 
> > 
> > > 
> > > On Mar 3, 2017, at 5:20 PM, Brian Foster <bfoster@redhat.com>
> > > wrote:
> > > 
> > > > 
> > > > On Fri, Mar 03, 2017 at 04:34:16PM -0500, Josef Bacik wrote:
> > > > > 
> > > > > On Fri, 2017-03-03 at 15:49 -0500, Brian Foster wrote:
> > > > > > 
> > > > > > On Fri, Mar 03, 2017 at 09:46:44AM -0500, Josef Bacik
> > > > > > wrote:
> > > > > > 
> > > > > > While testing nbd disconnects I kept hitting the following
> > > > > > hang
> > > > > > 
> > > > > > Call Trace:
> > > > > >  schedule+0x35/0x80
> > > > > >  xfs_ail_push_all_sync+0xa3/0xe0
> > > > > >  ? prepare_to_wait_event+0x100/0x100
> > > > > >  xfs_unmountfs+0x57/0x190
> > > > > >  xfs_fs_put_super+0x32/0x90
> > > > > >  generic_shutdown_super+0x6f/0xf0
> > > > > >  kill_block_super+0x27/0x70
> > > > > >  deactivate_locked_super+0x3e/0x70
> > > > > >  deactivate_super+0x46/0x60
> > > > > >  cleanup_mnt+0x3f/0x80
> > > > > >  __cleanup_mnt+0x12/0x20
> > > > > >  task_work_run+0x86/0xb0
> > > > > >  exit_to_usermode_loop+0x6d/0x96
> > > > > >  do_syscall_64+0x8b/0xa0
> > > > > >  entry_SYSCALL64_slow_path+0x25/0x25
> > > > > > 
> > > > > > After some digging around I found that there was a log item
> > > > > > on the
> > > > > > ail
> > > > > > with a callback of xfs_iflush_done.  A printk confirmed
> > > > > > that at the
> > > > > > call
> > > > > > to xfs_buf_attach_iodone in xfs_iflush_int had XBF_ASYNC
> > > > > > already
> > > > > > set,
> > > > > > which means on error we do not call xfs_buf_do_callbacks,
> > > > > > which
> > > > > > leaves
> > > > > > the log item on the ail list which causes us to hang on
> > > > > > unmount.  I
> > > > > > assume the block has XBF_ASYNC set because we did a
> > > > > > readahead on
> > > > > > it, so
> > > > > > it doesn't really need to have XBF_ASYNC set at this point
> > > > > > as we do
> > > > > > actually care about what happens to the buffer once IO is
> > > > > > complete.
> > > > > > With this patch my hang no longer happens.
> > > > > > 
> > > > > > Signed-off-by: Josef Bacik <jbacik@fb.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_buf_item.c | 5 +++++
> > > > > >  1 file changed, 5 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> > > > > > index 2975cb2..24fcb67 100644
> > > > > > --- a/fs/xfs/xfs_buf_item.c
> > > > > > +++ b/fs/xfs/xfs_buf_item.c
> > > > > > @@ -1016,6 +1016,11 @@ xfs_buf_attach_iodone(
> > > > > >  
> > > > > >      ASSERT(bp->b_iodone == NULL ||
> > > > > >             bp->b_iodone == xfs_buf_iodone_callbacks);
> > > > > > +    /*
> > > > > > +     * Somebody now cares about the fate of this buffer,
> > > > > > clear
> > > > > > XBF_ASYNC so
> > > > > > +     * that the iodone callback actually gets called.
> > > > > > +     */
> > > > > > +    bp->b_flags &= ~XBF_ASYNC;
> > > > > XBF_ASYNC generally describes the type of I/O being submitted
> > > > > for a
> > > > > particular buffer (blocking or non), so it being set or not
> > > > > is kind
> > > > > of
> > > > > outside the scope of xfs_iflush(). It really depends on how
> > > > > the
> > > > > caller
> > > > > ultimately wants to write the buffer. For example, if we get
> > > > > here via
> > > > > xfsaild_push()->xfs_inode_item_push(), the latter will do
> > > > > xfs_buf_delwri_queue() and the former will eventually call
> > > > > xfs_buf_delwri_submit_nowait(), which submits async I/O and
> > > > > so will
> > > > > re-add XBF_ASYNC.
> > > > > 
> > > > > It is interesting if this prevents your problem. I can't
> > > > > quite make
> > > > > out
> > > > > why that would be. The other xfs_iflush() caller is reclaim,
> > > > > which
> > > > > calls
> > > > > xfs_bwrite() which itself clears XBF_ASYNC before submission.
> > > > > 
> > > > > What I'm guessing you're referring to above wrt to not
> > > > > calling
> > > > > xfs_buf_do_callbacks() is the async check in
> > > > > xfs_buf_iodone_callback_error(), which is only relevant on
> > > > > I/O error.
> > > > > If
> > > > > the buffer I/O succeeds, we should always invoke the iodone
> > > > > callbacks
> > > > > regardless of async state. Do you observe otherwise or does
> > > > > this I/O
> > > > > indeed fail?
> > > > > 
> > > > > Anyways, if we change the buffer flags simply to control the
> > > > > behavior
> > > > > of
> > > > > iodone_callback_error(), we've basically decided to treat an
> > > > > otherwise
> > > > > async I/O as sync with the assumption that the submitter is
> > > > > going to
> > > > > handle the I/O error. As mentioned above, we don't really
> > > > > have that
> > > > > information here so that is kind of a dangerous assumption.
> > > > > At the
> > > > > very
> > > > > least, it bypasses all of the configurable error handling
> > > > > infrastructure
> > > > > that is specifically in place for async submitted buffers.
> > > > > 
> > > > > I'm wondering if you're hitting the flush locked inode retry
> > > > > problem
> > > > > that Carlos has been working on. The problem is basically
> > > > > that if the
> > > > > I/O fails on a buffer that backs a flush locked inode, we
> > > > > never retry
> > > > > the buffer and xfsaild basically spins on pushing the inode.
> > > > > What
> > > > > happens if you enable xfs tracepoints while the fs is hung?
> > > > Sorry for some reason my reply to linux-xfs got eaten.  It made
> > > > it to
> > > > Darrick but vger is being weird (sometimes it fails to get an
> > > > email,
> > > > sometimes it works fine).
> > > > 
> > > > So yes my patch doesn't actually work, I just didn't notice
> > > > that the
> > > > problem had happened until after I sent the page.
> > > > 
> > > > I'm doing this on a 4.6 kernel, I haven't actually tried it on
> > > > a
> > > > upstream kernel but my original assumption was it was still
> > > > available
> > > > to hit on an upstream kernel.  Basically on 4.6 we _always_
> > > > resubmit
> > > > the IO once if it fails, and then just drop the buffer if we've
> > > > re-
> > > > submitted once.  We never do the xfs_buf_do_callbacks() call,
> > > > which
> > > > seems wrong since as you say basically everything will set
> > > > XBF_ASYNC.
> > > > 
> > > The XBF_WRITE_FAIL resubmission is an internal retry thing
> > > intended to
> > > deal with transient errors. Assuming the internal retry fails, we
> > > do
> > > release the buffer but note that the higher level item is still
> > > tracked
> > > on the AIL (which basically just tracks that the inode is in the
> > > physical log and we're waiting on the metadata writeback to
> > > complete to
> > > free up space in the log).
> > > 
> > > IOWs, while the I/O path doesn't retry and releases the buffer,
> > > the
> > > intent is that the next xfsaild_push() cycle will retry the I/O
> > > (this is
> > > the part that is actually broken for flush locked inodes).
> > Yeah this sounds like what is happening, as I'll see xfs errors
> > every N seconds while its hanging, so it's retrying the IO, it's
> > just not failing out gracefully. What would be the course of action
> > here?  I can apply things and run them here pretty quickly.
> > 
> If it is actually the flush lock issue, Carlos was working on a
> proper
> fix for that. I'm not sure where that stands currently, however..
> Carlos?
> 
> If there isn't anything available to test quite yet, there was also a
> poc patch[1] more to verify what the problem was. I believe that one
> is
> at least safe, but still probably not production worthy as it may
> have
> performance ramifications. It should at least tell you whether that
> is
> the problem or something else is going on here.
> 
> Oh, commit e6b3bb789 ("xfs: add "fail at unmount" error handling
> configuration") may also be interesting here if the fs doesn't
> shutdown
> on unmount. That one looks like it came after v4.6 as well.

Ok I looked at that patch and the email that Dave sent in response,
which sounds like what I'm hitting.  Looking at the 4.6 code we simply
drop the xfs_buf if our IO retry fails and assume that the upper layer
will take care of things if it cares.  Meanwhile the thing writing out
the AIL doesn't actually know if things went wrong
unless xfs_iflush_done() is run.

In the upstream code you guys do this

        if (!(bp->b_flags & (XBF_STALE | XBF_WRITE_FAIL)) 
             bp->b_last_error != bp->b_error) {                    
                bp->b_flags |= (XBF_WRITE | XBF_DONE | XBF_WRITE_FAIL);
                bp->b_last_error = bp->b_error;                  
                if (cfg->retry_timeout != XFS_ERR_RETRY_FOREVER &&
                    !bp->b_first_retry_time)
                        bp->b_first_retry_time = jiffies;
                                                               
                xfs_buf_ioerror(bp, 0);                        
                xfs_buf_submit(bp);                                  
                return true;
        }

so you retry once, and the second time through we will return false if
we are unmounting, which means that xfs_buf_do_callbacks gets called.

Now you say that xfs_buf_do_callbacks() being called when we errored
out is dangerous, but I don't understand why.  We have this code

        if (need_ail) {
                struct xfs_log_item *log_items[need_ail];
                int i = 0;
                spin_lock(&ailp->xa_lock);
                for (blip = lip; blip; blip = blip->li_bio_list) {
                        iip = INODE_ITEM(blip);
                        if (iip->ili_logged &&
                            blip->li_lsn == iip->ili_flush_lsn) {
                                log_items[i++] = blip;
                        }
                        ASSERT(i <= need_ail);
                }
                /* xfs_trans_ail_delete_bulk() drops the AIL lock. */
                xfs_trans_ail_delete_bulk(ailp, log_items, i,
                                          SHUTDOWN_CORRUPT_INCORE);
        }

and need_ail is set if we didn't actually end up on disk properly.  So
we delete the thing and mark the fs as fucked.  Am I mis-reading what
is going on here?  This is what I did to our 4.6 tree (in testing
still, not actually pushed to production)

        if (bp->b_flags & XBF_ASYNC) {
                ASSERT(bp->b_iodone != NULL);

                trace_xfs_buf_item_iodone_async(bp, _RET_IP_);

                xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the flag
*/

                if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
                        bp->b_flags |= XBF_WRITE | XBF_ASYNC |
                                       XBF_DONE | XBF_WRITE_FAIL;
                        xfs_buf_submit(bp);
                } else {
+                       xfs_buf_do_callbacks(bp);
+                       bp->b_fspriv = NULL;
+                       bp->b_iodone = NULL;
                        xfs_buf_relse(bp);
                }

                return;
        }

I've marked the things that I changed with '+'.  So we do the retry,
and on the second go around we do the callbacks which trips the logic
in xfs_iflush_done() to clean up the AIL and complain loudly.  I'm not
seeing the danger here, in fact it seems like the appropriate thing to
do as xfs_iflush_done() appears to do the correct thing if we call it
when something failed.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Dave Chinner March 7, 2017, 1:22 a.m. UTC | #7

On Mon, Mar 06, 2017 at 02:00:13PM -0500, Josef Bacik wrote:
> so you retry once, and the second time through we will return false if
> we are unmounting, which means that xfs_buf_do_callbacks gets called.
> 
> Now you say that xfs_buf_do_callbacks() being called when we errored
> out is dangerous, but I don't understand why.  We have this code
> 
>         if (need_ail) {
>                 struct xfs_log_item *log_items[need_ail];
>                 int i = 0;
>                 spin_lock(&ailp->xa_lock);
>                 for (blip = lip; blip; blip = blip->li_bio_list) {
>                         iip = INODE_ITEM(blip);
>                         if (iip->ili_logged &&
>                             blip->li_lsn == iip->ili_flush_lsn) {
>                                 log_items[i++] = blip;
>                         }
>                         ASSERT(i <= need_ail);
>                 }
>                 /* xfs_trans_ail_delete_bulk() drops the AIL lock. */
>                 xfs_trans_ail_delete_bulk(ailp, log_items, i,
>                                           SHUTDOWN_CORRUPT_INCORE);
>         }

[ Your cut-n-paste is horribly mangled in something non-ascii. ]

> and need_ail is set if we didn't actually end up on disk properly.  So
> we delete the thing and mark the fs as fucked.  Am I mis-reading what
> is going on here?

Yes. That won't shut down the filesystem on an inode write error. It
will simply remove the inode from the AIL, the error will do dropped
on the floor, and the filesystem will not be shut down even though
it's now in a corrupt state. xfs_trans_ail_delete_bulk() only shuts
down the filesytem if it doesn't find the item being removed form
the AIL in the AIL.

As I keep telling people, what needs to happen is this:

	1. buffer write error occurs
	2. retries exhausted, mark buffer as errored, run callbacks
	3. xfs_iflush_done() (and other iodone callbacks) need to
	   look at the buffer error state to determine what to do.
	4. if the buffer is errored and recovery is not possible,
	   the callback needs to shut down the filesytem.
	5. the callback then needs to call flush abort function for
	   item (e.g. xfs_iflush_abort()) rather than the normal
	   IO completion processing path.

>  This is what I did to our 4.6 tree (in testing
> still, not actually pushed to production)
>
>         if (bp->b_flags & XBF_ASYNC) {
>                 ASSERT(bp->b_iodone != NULL);
> 
>                 trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> 
>                 xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the flag
> */
> 
>                 if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
>                         bp->b_flags |= XBF_WRITE | XBF_ASYNC |
>                                        XBF_DONE | XBF_WRITE_FAIL;
>                         xfs_buf_submit(bp);
>                 } else {
> +                       xfs_buf_do_callbacks(bp);
> +                       bp->b_fspriv = NULL;
> +                       bp->b_iodone = NULL;
>                         xfs_buf_relse(bp);
>                 }
> 
>                 return;
>         }

It's likely you'll end up with silent on disk corruption if you do
that because it's just tossing the error state away.

Cheers,

Dave.

Josef Bacik March 7, 2017, 2:37 a.m. UTC | #8

> On Mar 6, 2017, at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
>> On Mon, Mar 06, 2017 at 02:00:13PM -0500, Josef Bacik wrote:
>> so you retry once, and the second time through we will return false if
>> we are unmounting, which means that xfs_buf_do_callbacks gets called.
>> 
>> Now you say that xfs_buf_do_callbacks() being called when we errored
>> out is dangerous, but I don't understand why.  We have this code
>> 
>>         if (need_ail) {
>>                 struct xfs_log_item *log_items[need_ail];
>>                 int i = 0;
>>                 spin_lock(&ailp->xa_lock);
>>                 for (blip = lip; blip; blip = blip->li_bio_list) {
>>                         iip = INODE_ITEM(blip);
>>                         if (iip->ili_logged &&
>>                             blip->li_lsn == iip->ili_flush_lsn) {
>>                                 log_items[i++] = blip;
>>                         }
>>                         ASSERT(i <= need_ail);
>>                 }
>>                 /* xfs_trans_ail_delete_bulk() drops the AIL lock. */
>>                 xfs_trans_ail_delete_bulk(ailp, log_items, i,
>>                                           SHUTDOWN_CORRUPT_INCORE);
>>         }
> 
> [ Your cut-n-paste is horribly mangled in something non-ascii. ]
> 

Yeah I keep trying new mail clients and it's just not going well.

>> and need_ail is set if we didn't actually end up on disk properly.  So
>> we delete the thing and mark the fs as fucked.  Am I mis-reading what
>> is going on here?
> 
> Yes. That won't shut down the filesystem on an inode write error. It
> will simply remove the inode from the AIL, the error will do dropped
> on the floor, and the filesystem will not be shut down even though
> it's now in a corrupt state. xfs_trans_ail_delete_bulk() only shuts
> down the filesytem if it doesn't find the item being removed form
> the AIL in the AIL.
> 
> As I keep telling people, what needs to happen is this:
> 
>    1. buffer write error occurs
>    2. retries exhausted, mark buffer as errored, run callbacks
>    3. xfs_iflush_done() (and other iodone callbacks) need to
>       look at the buffer error state to determine what to do.
>    4. if the buffer is errored and recovery is not possible,
>       the callback needs to shut down the filesytem.
>    5. the callback then needs to call flush abort function for
>       item (e.g. xfs_iflush_abort()) rather than the normal
>       IO completion processing path.
> 
>>  This is what I did to our 4.6 tree (in testing
>> still, not actually pushed to production)
>> 
>>         if (bp->b_flags & XBF_ASYNC) {
>>                 ASSERT(bp->b_iodone != NULL);
>> 
>>                 trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
>> 
>>                 xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the flag
>> */
>> 
>>                 if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
>>                         bp->b_flags |= XBF_WRITE | XBF_ASYNC |
>>                                        XBF_DONE | XBF_WRITE_FAIL;
>>                         xfs_buf_submit(bp);
>>                 } else {
>> +                       xfs_buf_do_callbacks(bp);
>> +                       bp->b_fspriv = NULL;
>> +                       bp->b_iodone = NULL;
>>                         xfs_buf_relse(bp);
>>                 }
>> 
>>                 return;
>>         }
> 
> It's likely you'll end up with silent on disk corruption if you do
> that because it's just tossing the error state away.

Ok great this is what I needed, thanks Dave,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Josef Bacik March 7, 2017, 3:56 p.m. UTC | #9

On Tue, 2017-03-07 at 12:22 +1100, Dave Chinner wrote:
> On Mon, Mar 06, 2017 at 02:00:13PM -0500, Josef Bacik wrote:
> > 
> > so you retry once, and the second time through we will return false
> > if
> > we are unmounting, which means that xfs_buf_do_callbacks gets
> > called.
> > 
> > Now you say that xfs_buf_do_callbacks() being called when we
> > errored
> > out is dangerous, but I don't understand why.  We have this code
> > 
> >         if (need_ail) {
> >                 struct xfs_log_item *log_items[need_ail];
> >                 int i = 0;
> >                 spin_lock(&ailp->xa_lock);
> >                 for (blip = lip; blip; blip = blip->li_bio_list) {
> >                         iip = INODE_ITEM(blip);
> >                         if (iip->ili_logged &&
> >                             blip->li_lsn == iip->ili_flush_lsn) {
> >                                 log_items[i++] = blip;
> >                         }
> >                         ASSERT(i <= need_ail);
> >                 }
> >                 /* xfs_trans_ail_delete_bulk() drops the AIL lock.
> > */
> >                 xfs_trans_ail_delete_bulk(ailp, log_items, i,
> >                                           SHUTDOWN_CORRUPT_INCORE);
> >         }
> [ Your cut-n-paste is horribly mangled in something non-ascii. ]
> 
> > 
> > and need_ail is set if we didn't actually end up on disk properly.
> >  So
> > we delete the thing and mark the fs as fucked.  Am I mis-reading
> > what
> > is going on here?
> Yes. That won't shut down the filesystem on an inode write error. It
> will simply remove the inode from the AIL, the error will do dropped
> on the floor, and the filesystem will not be shut down even though
> it's now in a corrupt state. xfs_trans_ail_delete_bulk() only shuts
> down the filesytem if it doesn't find the item being removed form
> the AIL in the AIL.
> 
> As I keep telling people, what needs to happen is this:
> 
> 	1. buffer write error occurs
> 	2. retries exhausted, mark buffer as errored, run callbacks
> 	3. xfs_iflush_done() (and other iodone callbacks) need to
> 	   look at the buffer error state to determine what to do.
> 	4. if the buffer is errored and recovery is not possible,
> 	   the callback needs to shut down the filesytem.
> 	5. the callback then needs to call flush abort function for
> 	   item (e.g. xfs_iflush_abort()) rather than the normal
> 	   IO completion processing path.
> 
> > 
> >  This is what I did to our 4.6 tree (in testing
> > still, not actually pushed to production)
> > 
> >         if (bp->b_flags & XBF_ASYNC) {
> >                 ASSERT(bp->b_iodone != NULL);
> > 
> >                 trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> > 
> >                 xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the
> > flag
> > */
> > 
> >                 if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
> >                         bp->b_flags |= XBF_WRITE | XBF_ASYNC
> >                                        XBF_DONE | XBF_WRITE_FAIL;
> >                         xfs_buf_submit(bp);
> >                 } else {
> > +                       xfs_buf_do_callbacks(bp);
> > +                       bp->b_fspriv = NULL;
> > +                       bp->b_iodone = NULL;
> >                         xfs_buf_relse(bp);
> >                 }
> > 
> >                 return;
> >         }
> It's likely you'll end up with silent on disk corruption if you do
> that because it's just tossing the error state away.

I think I'm getting a handle on this, but I wonder if we don't have
this problem currently?  We get an error on a buffer, we retry it and
it fails again.  If we don't have XFS_ERR_RETRY_FOREVER set then we
error out and do the normal callbacks as if nothing ever happened, and
we get the silent corruption without shutting down the file system.
 How is the code as it stands without XFS_ERR_RETRY_FOREVER safe?
 Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Brian Foster March 7, 2017, 4:12 p.m. UTC | #10

On Tue, Mar 07, 2017 at 10:56:05AM -0500, Josef Bacik wrote:
> On Tue, 2017-03-07 at 12:22 +1100, Dave Chinner wrote:
> > On Mon, Mar 06, 2017 at 02:00:13PM -0500, Josef Bacik wrote:
> > > 
> > > so you retry once, and the second time through we will return false
> > > if
> > > we are unmounting, which means that xfs_buf_do_callbacks gets
> > > called.
> > > 
> > > Now you say that xfs_buf_do_callbacks() being called when we
> > > errored
> > > out is dangerous, but I don't understand why.  We have this code
> > > 
> > >         if (need_ail) {
> > >                 struct xfs_log_item *log_items[need_ail];
> > >                 int i = 0;
> > >                 spin_lock(&ailp->xa_lock);
> > >                 for (blip = lip; blip; blip = blip->li_bio_list) {
> > >                         iip = INODE_ITEM(blip);
> > >                         if (iip->ili_logged &&
> > >                             blip->li_lsn == iip->ili_flush_lsn) {
> > >                                 log_items[i++] = blip;
> > >                         }
> > >                         ASSERT(i <= need_ail);
> > >                 }
> > >                 /* xfs_trans_ail_delete_bulk() drops the AIL lock.
> > > */
> > >                 xfs_trans_ail_delete_bulk(ailp, log_items, i,
> > >                                           SHUTDOWN_CORRUPT_INCORE);
> > >         }
> > [ Your cut-n-paste is horribly mangled in something non-ascii. ]
> > 
> > > 
> > > and need_ail is set if we didn't actually end up on disk properly.
> > >  So
> > > we delete the thing and mark the fs as fucked.  Am I mis-reading
> > > what
> > > is going on here?
> > Yes. That won't shut down the filesystem on an inode write error. It
> > will simply remove the inode from the AIL, the error will do dropped
> > on the floor, and the filesystem will not be shut down even though
> > it's now in a corrupt state. xfs_trans_ail_delete_bulk() only shuts
> > down the filesytem if it doesn't find the item being removed form
> > the AIL in the AIL.
> > 
> > As I keep telling people, what needs to happen is this:
> > 
> > 	1. buffer write error occurs
> > 	2. retries exhausted, mark buffer as errored, run callbacks
> > 	3. xfs_iflush_done() (and other iodone callbacks) need to
> > 	   look at the buffer error state to determine what to do.
> > 	4. if the buffer is errored and recovery is not possible,
> > 	   the callback needs to shut down the filesytem.
> > 	5. the callback then needs to call flush abort function for
> > 	   item (e.g. xfs_iflush_abort()) rather than the normal
> > 	   IO completion processing path.
> > 
> > > 
> > >  This is what I did to our 4.6 tree (in testing
> > > still, not actually pushed to production)
> > > 
> > >         if (bp->b_flags & XBF_ASYNC) {
> > >                 ASSERT(bp->b_iodone != NULL);
> > > 
> > >                 trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> > > 
> > >                 xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the
> > > flag
> > > */
> > > 
> > >                 if (!(bp->b_flags & (XBF_STALE|XBF_WRITE_FAIL))) {
> > >                         bp->b_flags |= XBF_WRITE | XBF_ASYNC
> > >                                        XBF_DONE | XBF_WRITE_FAIL;
> > >                         xfs_buf_submit(bp);
> > >                 } else {
> > > +                       xfs_buf_do_callbacks(bp);
> > > +                       bp->b_fspriv = NULL;
> > > +                       bp->b_iodone = NULL;
> > >                         xfs_buf_relse(bp);
> > >                 }
> > > 
> > >                 return;
> > >         }
> > It's likely you'll end up with silent on disk corruption if you do
> > that because it's just tossing the error state away.
> 
> I think I'm getting a handle on this, but I wonder if we don't have
> this problem currently?  We get an error on a buffer, we retry it and
> it fails again.  If we don't have XFS_ERR_RETRY_FOREVER set then we
> error out and do the normal callbacks as if nothing ever happened, and
> we get the silent corruption without shutting down the file system.
>  How is the code as it stands without XFS_ERR_RETRY_FOREVER safe?
>  Thanks,
> 

We shouldn't get into a situation where we run the callbacks when the
I/O has failed and the fs isn't shut down. IOWs, we do run the callbacks
only when the I/O succeeds, or it fails and the fs is shutdown as a
result (never when the I/O fails but we haven't shut down, which is the
problematic case the patch above introduces).

That tunable error handling logic (XFS_ERR_RETRY_FOREVER) basically
determines when an I/O error is considered permanent, which means
retries are futile and we have no choice but to shutdown the fs. E.g.,
if an error is deemed permanent, we jump to permanent_error:

        /*
         * Permanent error - we need to trigger a shutdown if we haven't already
         * to indicate that inconsistency will result from this action.
         */
permanent_error:
        xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
out_stale:
        xfs_buf_stale(bp);
        bp->b_flags |= XBF_DONE;
        trace_xfs_buf_error_relse(bp, _RET_IP_);
        return false;
}

... which shuts down the filesystem before we go any farther.

Brian

> Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Carlos Maiolino April 3, 2017, 9:19 a.m. UTC | #11

Hi,

I really apologize for my late reply, March was a busy month and this thread got
lost on my mailbox.

I'll try to come back to this fix as soon as possible,

> > > > 
> > > >                 return;
> > > >         }
> > > It's likely you'll end up with silent on disk corruption if you do
> > > that because it's just tossing the error state away.
> > 
> > I think I'm getting a handle on this, but I wonder if we don't have
> > this problem currently?  We get an error on a buffer, we retry it and
> > it fails again.  If we don't have XFS_ERR_RETRY_FOREVER set then we
> > error out and do the normal callbacks as if nothing ever happened, and
> > we get the silent corruption without shutting down the file system.
> >  How is the code as it stands without XFS_ERR_RETRY_FOREVER safe?
> >  Thanks,
> > 
> 
> We shouldn't get into a situation where we run the callbacks when the
> I/O has failed and the fs isn't shut down. IOWs, we do run the callbacks
> only when the I/O succeeds, or it fails and the fs is shutdown as a
> result (never when the I/O fails but we haven't shut down, which is the
> problematic case the patch above introduces).
> 
> That tunable error handling logic (XFS_ERR_RETRY_FOREVER) basically
> determines when an I/O error is considered permanent, which means
> retries are futile and we have no choice but to shutdown the fs. E.g.,
> if an error is deemed permanent, we jump to permanent_error:
> 
>         /*
>          * Permanent error - we need to trigger a shutdown if we haven't already
>          * to indicate that inconsistency will result from this action.
>          */
> permanent_error:
>         xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> out_stale:
>         xfs_buf_stale(bp);
>         bp->b_flags |= XBF_DONE;
>         trace_xfs_buf_error_relse(bp, _RET_IP_);
>         return false;
> }
> 
> ... which shuts down the filesystem before we go any farther.
> 
> Brian
> 
> > Josef

Josef, did you have any success in avoiding this lockup problem with the
fail/retry mechanism mentioned? Or do you keep seeing it?

xfs: clear XBF_ASYNC if we attach an iodone callback

Commit Message

Comments

Patch