umount XFS hung when stopping the xfsaild kthread
diff mbox

Message ID c869cd34-269b-a13d-5009-390e70aa47ac@huawei.com
State New
Headers show

Commit Message

Hou Tao Sept. 5, 2017, 1:48 p.m. UTC
Hi all,

We recently encounter a XFS umount hang problem. As we can see the following
stacks, the umount process was trying to stop the xfsaild kthread and waiting
for the exit of the xfsaild thread, and the xfsaild thread was waiting for
wake-up.

[<ffffffff810a604a>] kthread_stop+0x4a/0xe0
[<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
[<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
[<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
[<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
[<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
[<ffffffff811ebf27>] kill_block_super+0x27/0x70
[<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
[<ffffffff811ec866>] deactivate_super+0x46/0x60
[<ffffffff81209995>] mntput_no_expire+0xc5/0x120
[<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
[<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

[<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
[<ffffffff810a5ddf>] kthread+0xcf/0xe0
[<ffffffff81652958>] ret_from_fork+0x58/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
I have check the related code and suspect the same problem may also exists in
the mainline.

The following is the possible sequences which may lead to the hang of umount:

xfsaild: kthread_should_stop() // return false, so xfsaild continue

umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()

umount: wake_up_process() // because xfsaild is still running, so 0 is returned

xfsaild: __set_current_state()
xfsaild: schedule() // Now, on one will wake it up

The solution I think is adding an extra kthread_should_stop() before
invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
ensure the read of the stop flag happens after the write of the task status.
Something likes the following patch:


Any suggestions ?

Regards,

Tao


--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Brian Foster Sept. 5, 2017, 6:27 p.m. UTC | #1
On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> Hi all,
> 
> We recently encounter a XFS umount hang problem. As we can see the following
> stacks, the umount process was trying to stop the xfsaild kthread and waiting
> for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> wake-up.
> 
> [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> [<ffffffff811ec866>] deactivate_super+0x46/0x60
> [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> [<ffffffff81652958>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> I have check the related code and suspect the same problem may also exists in
> the mainline.
> 
> The following is the possible sequences which may lead to the hang of umount:
> 
> xfsaild: kthread_should_stop() // return false, so xfsaild continue
> 
> umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> 
> umount: wake_up_process() // because xfsaild is still running, so 0 is returned
> 
> xfsaild: __set_current_state()
> xfsaild: schedule() // Now, on one will wake it up
> 

That seems plausible to me. kthread_stop() sets the SHOULD_STOP bit and
then wakes up the thread. On the other side, xfsaild() checks
SHOULD_STOP, sets the task state and sleeps. It seems like it should be
possible for this to hang if xfsaild() checks the should stop bit and
then kthread_stop() runs before we set the task state (as you've
outlined above).

> The solution I think is adding an extra kthread_should_stop() before
> invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
> ensure the read of the stop flag happens after the write of the task status.
> Something likes the following patch:
> 

I think the important bit is to check after we've set the task state,
yes? That way we know either we've seen the bit or kthread_stop() has
woken the task (which I think means the task remains runnable and will
be rescheduled such that it exits). If so, I'd probably move the check
up after the task state set and add a comment.

> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9056c0f..6313f67 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -520,6 +520,11 @@ xfsaild(
>                 if (!xfs_ail_min(ailp) &&
>                     ailp->xa_target == ailp->xa_target_prev) {
>                         spin_unlock(&ailp->xa_lock);
> +
> +                       smp_mb();
> +                       if (kthread_should_stop())
> +                               break;
> +
>                         freezable_schedule();
>                         tout = 0;
>                         continue;
> 
> Any suggestions ?
> 

Could you try some hacks to verify this problem on an upstream kernel? I
think you should be able to add an artificial delay in xfsaild() before
we set the task state when the fs is unmounting and the AIL is empty
(ie., it looks like we're going to schedule out), then add a smaller
delay to xfs_trans_ail_destroy() to make sure we wait for xfsaild() to
settle, but run kthread_stop() between the should_stop check and setting
the task state. Then we could potentially confirm the problem and verify
the fix unwinds everything correctly. Hm?

Brian

> Regards,
> 
> Tao
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 5, 2017, 11 p.m. UTC | #2
On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> Hi all,
> 
> We recently encounter a XFS umount hang problem. As we can see the following
> stacks, the umount process was trying to stop the xfsaild kthread and waiting
> for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> wake-up.
> 
> [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> [<ffffffff811ec866>] deactivate_super+0x46/0x60
> [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> [<ffffffff81652958>] ret_from_fork+0x58/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> I have check the related code and suspect the same problem may also exists in
> the mainline.
> 
> The following is the possible sequences which may lead to the hang of umount:
> 
> xfsaild: kthread_should_stop() // return false, so xfsaild continue
> 
> umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> 
> umount: wake_up_process() // because xfsaild is still running, so 0 is returned
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This, to me, is where the problem lies. By the time unmount is
asking the aild to stop, the xfsaild should already be idle and
scheduled because unmount has just completed a syncrhonous push of
the AIL. i.e.  xfs_ail_push_all_sync()) waits for the AIL to empty
which should result in the aild returning to the idle state and
sleeping in freezable_schedule().

Work out why the aild is still running after the log has supposedly
been emptied and unmount records have been written first, then look
for a solution. Also, as Brian suggested, reproducing on an upstream
kernel is a good idea, because it's entirely possible this is a
vendor kernel (i.e.  RHEL) specific bug....

> xfsaild: __set_current_state()
> xfsaild: schedule() // Now, on one will wake it up
> 
> The solution I think is adding an extra kthread_should_stop() before
> invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
> ensure the read of the stop flag happens after the write of the task status.
> Something likes the following patch:
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 9056c0f..6313f67 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -520,6 +520,11 @@ xfsaild(
>                 if (!xfs_ail_min(ailp) &&
>                     ailp->xa_target == ailp->xa_target_prev) {
>                         spin_unlock(&ailp->xa_lock);
> +
> +                       smp_mb();
> +                       if (kthread_should_stop())
> +                               break;
> +
>                         freezable_schedule();
>                         tout = 0;
>                         continue;

This is still racy.  i.e. What happens if the stop bit is set
between the new kthread_should_stop() check and
freezable_schedule()? It'll still hang, right?

Also, breaking out of the loop there leaves the task in
TASK_INTERRUPTIBLE/TASK_KILLABLE state - it needs to leave this
function in TASK_RUNNING state.

FWIW, it is my understanding that the sort of schedule vs ttwu race
you are implying exists here are avoided by the task state being set
to TASK_INTERRUPTIBLE/TASK_KILLABLE and schedule only parking the
task if the task was in this state. i.e. if ttwu is called before
schedule, then the task state will have been modified to either
TASK_WAKING or TASK_RUNNING before schedule is called and so
schedule() is then effectively a no-op.  In that case, we go around
the loop again, hit the kthread_should_stop() check, and we stop.
Hence, if I've remembered this all correctly, I don't think adding
this extra kthread_should_stop() check will make any difference,
either.

Cheers,

Dave.
Hou Tao Sept. 6, 2017, 10:59 a.m. UTC | #3
Hi Brian,

On 2017/9/6 2:27, Brian Foster wrote:
> On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
>> Hi all,
>>
>> We recently encounter a XFS umount hang problem. As we can see the following
>> stacks, the umount process was trying to stop the xfsaild kthread and waiting
>> for the exit of the xfsaild thread, and the xfsaild thread was waiting for
>> wake-up.
>>
>> [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
>> [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
>> [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
>> [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
>> [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
>> [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
>> [<ffffffff811ebf27>] kill_block_super+0x27/0x70
>> [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
>> [<ffffffff811ec866>] deactivate_super+0x46/0x60
>> [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
>> [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
>> [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
>> [<ffffffff810a5ddf>] kthread+0xcf/0xe0
>> [<ffffffff81652958>] ret_from_fork+0x58/0x90
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
>> I have check the related code and suspect the same problem may also exists in
>> the mainline.
>>
>> The following is the possible sequences which may lead to the hang of umount:
>>
>> xfsaild: kthread_should_stop() // return false, so xfsaild continue
>>
>> umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
>>
>> umount: wake_up_process() // because xfsaild is still running, so 0 is returned
>>
>> xfsaild: __set_current_state()
>> xfsaild: schedule() // Now, on one will wake it up
>>
> 
> That seems plausible to me. kthread_stop() sets the SHOULD_STOP bit and
> then wakes up the thread. On the other side, xfsaild() checks
> SHOULD_STOP, sets the task state and sleeps. It seems like it should be
> possible for this to hang if xfsaild() checks the should stop bit and
> then kthread_stop() runs before we set the task state (as you've
> outlined above).
> 
>> The solution I think is adding an extra kthread_should_stop() before
>> invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
>> ensure the read of the stop flag happens after the write of the task status.
>> Something likes the following patch:
>>
> 
> I think the important bit is to check after we've set the task state,
> yes? That way we know either we've seen the bit or kthread_stop() has
> woken the task (which I think means the task remains runnable and will
> be rescheduled such that it exits). If so, I'd probably move the check
> up after the task state set and add a comment.
Yes, that's what I tries to express (but failed to).
Yes, moving the check up after the task state set is much clearer.

>> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
>> index 9056c0f..6313f67 100644
>> --- a/fs/xfs/xfs_trans_ail.c
>> +++ b/fs/xfs/xfs_trans_ail.c
>> @@ -520,6 +520,11 @@ xfsaild(
>>                 if (!xfs_ail_min(ailp) &&
>>                     ailp->xa_target == ailp->xa_target_prev) {
>>                         spin_unlock(&ailp->xa_lock);
>> +
>> +                       smp_mb();
>> +                       if (kthread_should_stop())
>> +                               break;
>> +
>>                         freezable_schedule();
>>                         tout = 0;
>>                         continue;
>>
>> Any suggestions ?
>>
> 
> Could you try some hacks to verify this problem on an upstream kernel? I
> think you should be able to add an artificial delay in xfsaild() before
> we set the task state when the fs is unmounting and the AIL is empty
> (ie., it looks like we're going to schedule out), then add a smaller
> delay to xfs_trans_ail_destroy() to make sure we wait for xfsaild() to
> settle, but run kthread_stop() between the should_stop check and setting
> the task state. Then we could potentially confirm the problem and verify
> the fix unwinds everything correctly. Hm?
Thanks for your suggestion. As per your suggestion, we had reproduced the
umount hang problem on both RHEL7 kernel and upstream kernel.

We just add a 50us delay before the kthread_should_stop() and __set_current_state()
in xfsaild(), and add a 20us delay before kthread_stop() in xfs_trans_ail_destroy().
After the hacks, we run an infinite loop: mount + do nothing + umount, and then the
hang of umount will occur after ten minutes or less.

Regards,

Tao

> 
> Brian
> 
>> Regards,
>>
>> Tao
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Sept. 6, 2017, 11:11 a.m. UTC | #4
On Wed, Sep 06, 2017 at 09:00:43AM +1000, Dave Chinner wrote:
> On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> > Hi all,
> > 
> > We recently encounter a XFS umount hang problem. As we can see the following
> > stacks, the umount process was trying to stop the xfsaild kthread and waiting
> > for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> > wake-up.
> > 
> > [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> > [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> > [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> > [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> > [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> > [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> > [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> > [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> > [<ffffffff811ec866>] deactivate_super+0x46/0x60
> > [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> > [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> > [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> > [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> > [<ffffffff81652958>] ret_from_fork+0x58/0x90
> > [<ffffffffffffffff>] 0xffffffffffffffff
> > 
> > The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> > I have check the related code and suspect the same problem may also exists in
> > the mainline.
> > 
> > The following is the possible sequences which may lead to the hang of umount:
> > 
> > xfsaild: kthread_should_stop() // return false, so xfsaild continue
> > 
> > umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> > 
> > umount: wake_up_process() // because xfsaild is still running, so 0 is returned
>                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This, to me, is where the problem lies. By the time unmount is
> asking the aild to stop, the xfsaild should already be idle and
> scheduled because unmount has just completed a syncrhonous push of
> the AIL. i.e.  xfs_ail_push_all_sync()) waits for the AIL to empty
> which should result in the aild returning to the idle state and
> sleeping in freezable_schedule().
> 

I think this behavior is to be expected. The xfsaild() logic schedules
itself out without a timeout when the AIL is empty, but the task may not
see the AIL as empty immediately because the empty state doesn't occur
until I/O completion of the associated buffers removes all of the log
items from the AIL. Therefore, I think it's probably normal for xfsaild
to spin around a bit until that empty state occurs. This is probably
what allows kthread_stop() to race at unmount time.

> Work out why the aild is still running after the log has supposedly
> been emptied and unmount records have been written first, then look
> for a solution. Also, as Brian suggested, reproducing on an upstream
> kernel is a good idea, because it's entirely possible this is a
> vendor kernel (i.e.  RHEL) specific bug....
> 

FWIW, I ran a quick test on for-next since there hasn't been a reply to
this thread in that regard. Add a 10s delay between
kthread_should_stop() and __set_current_state() in xfsaild (when
unmounting and AIL is empty) and a 5s delay before kthread_stop() in
xfs_trans_ail_destroy() and the problem reproduces consistently.

Checking kthread_should_stop() after we set the task state addresses the
problem. This is because of the order of operations between
kthread_stop() and xfsaild(). The former sets the stop bit and wakes the
task. If the latter sets the task state and then checks the stop bit (as
opposed to doing the opposite as it does currently), it will either see
the stop bit and exit or the task state is reset to runnable such that
it isn't blocked indefinitely (and the next iteration detects the stop
bit).

Hou,

Care to update your patch with this information and the previous
suggestions from Dave and I (pull up the check, add a comment, and make
sure to reset the task state)?

Brian

> > xfsaild: __set_current_state()
> > xfsaild: schedule() // Now, on one will wake it up
> > 
> > The solution I think is adding an extra kthread_should_stop() before
> > invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
> > ensure the read of the stop flag happens after the write of the task status.
> > Something likes the following patch:
> > 
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index 9056c0f..6313f67 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -520,6 +520,11 @@ xfsaild(
> >                 if (!xfs_ail_min(ailp) &&
> >                     ailp->xa_target == ailp->xa_target_prev) {
> >                         spin_unlock(&ailp->xa_lock);
> > +
> > +                       smp_mb();
> > +                       if (kthread_should_stop())
> > +                               break;
> > +
> >                         freezable_schedule();
> >                         tout = 0;
> >                         continue;
> 
> This is still racy.  i.e. What happens if the stop bit is set
> between the new kthread_should_stop() check and
> freezable_schedule()? It'll still hang, right?
> 
> Also, breaking out of the loop there leaves the task in
> TASK_INTERRUPTIBLE/TASK_KILLABLE state - it needs to leave this
> function in TASK_RUNNING state.
> 
> FWIW, it is my understanding that the sort of schedule vs ttwu race
> you are implying exists here are avoided by the task state being set
> to TASK_INTERRUPTIBLE/TASK_KILLABLE and schedule only parking the
> task if the task was in this state. i.e. if ttwu is called before
> schedule, then the task state will have been modified to either
> TASK_WAKING or TASK_RUNNING before schedule is called and so
> schedule() is then effectively a no-op.  In that case, we go around
> the loop again, hit the kthread_should_stop() check, and we stop.
> Hence, if I've remembered this all correctly, I don't think adding
> this extra kthread_should_stop() check will make any difference,
> either.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Sept. 6, 2017, 11:13 a.m. UTC | #5
On Wed, Sep 06, 2017 at 06:59:26PM +0800, Hou Tao wrote:
> Hi Brian,
> 
> On 2017/9/6 2:27, Brian Foster wrote:
> > On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> >> Hi all,
> >>
> >> We recently encounter a XFS umount hang problem. As we can see the following
> >> stacks, the umount process was trying to stop the xfsaild kthread and waiting
> >> for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> >> wake-up.
> >>
> >> [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> >> [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> >> [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> >> [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> >> [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> >> [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> >> [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> >> [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> >> [<ffffffff811ec866>] deactivate_super+0x46/0x60
> >> [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> >> [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> >> [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >>
> >> [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> >> [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> >> [<ffffffff81652958>] ret_from_fork+0x58/0x90
> >> [<ffffffffffffffff>] 0xffffffffffffffff
> >>
> >> The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> >> I have check the related code and suspect the same problem may also exists in
> >> the mainline.
> >>
> >> The following is the possible sequences which may lead to the hang of umount:
> >>
> >> xfsaild: kthread_should_stop() // return false, so xfsaild continue
> >>
> >> umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> >>
> >> umount: wake_up_process() // because xfsaild is still running, so 0 is returned
> >>
> >> xfsaild: __set_current_state()
> >> xfsaild: schedule() // Now, on one will wake it up
> >>
> > 
> > That seems plausible to me. kthread_stop() sets the SHOULD_STOP bit and
> > then wakes up the thread. On the other side, xfsaild() checks
> > SHOULD_STOP, sets the task state and sleeps. It seems like it should be
> > possible for this to hang if xfsaild() checks the should stop bit and
> > then kthread_stop() runs before we set the task state (as you've
> > outlined above).
> > 
> >> The solution I think is adding an extra kthread_should_stop() before
> >> invoking schedule(). Maybe a smp_mb() is needed too, because we needs to
> >> ensure the read of the stop flag happens after the write of the task status.
> >> Something likes the following patch:
> >>
> > 
> > I think the important bit is to check after we've set the task state,
> > yes? That way we know either we've seen the bit or kthread_stop() has
> > woken the task (which I think means the task remains runnable and will
> > be rescheduled such that it exits). If so, I'd probably move the check
> > up after the task state set and add a comment.
> Yes, that's what I tries to express (but failed to).
> Yes, moving the check up after the task state set is much clearer.
> 
> >> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> >> index 9056c0f..6313f67 100644
> >> --- a/fs/xfs/xfs_trans_ail.c
> >> +++ b/fs/xfs/xfs_trans_ail.c
> >> @@ -520,6 +520,11 @@ xfsaild(
> >>                 if (!xfs_ail_min(ailp) &&
> >>                     ailp->xa_target == ailp->xa_target_prev) {
> >>                         spin_unlock(&ailp->xa_lock);
> >> +
> >> +                       smp_mb();
> >> +                       if (kthread_should_stop())
> >> +                               break;
> >> +
> >>                         freezable_schedule();
> >>                         tout = 0;
> >>                         continue;
> >>
> >> Any suggestions ?
> >>
> > 
> > Could you try some hacks to verify this problem on an upstream kernel? I
> > think you should be able to add an artificial delay in xfsaild() before
> > we set the task state when the fs is unmounting and the AIL is empty
> > (ie., it looks like we're going to schedule out), then add a smaller
> > delay to xfs_trans_ail_destroy() to make sure we wait for xfsaild() to
> > settle, but run kthread_stop() between the should_stop check and setting
> > the task state. Then we could potentially confirm the problem and verify
> > the fix unwinds everything correctly. Hm?
> Thanks for your suggestion. As per your suggestion, we had reproduced the
> umount hang problem on both RHEL7 kernel and upstream kernel.
> 
> We just add a 50us delay before the kthread_should_stop() and __set_current_state()
> in xfsaild(), and add a 20us delay before kthread_stop() in xfs_trans_ail_destroy().
> After the hacks, we run an infinite loop: mount + do nothing + umount, and then the
> hang of umount will occur after ten minutes or less.
> 

Ok, thanks for testing. I just sent another mail that shows a larger
delay will reproduce the same problem in one test cycle (mount, touch a
file, umount).

Brian

> Regards,
> 
> Tao
> 
> > 
> > Brian
> > 
> >> Regards,
> >>
> >> Tao
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > .
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 6, 2017, 11:47 a.m. UTC | #6
On Wed, Sep 06, 2017 at 07:11:45AM -0400, Brian Foster wrote:
> On Wed, Sep 06, 2017 at 09:00:43AM +1000, Dave Chinner wrote:
> > On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> > > Hi all,
> > > 
> > > We recently encounter a XFS umount hang problem. As we can see the following
> > > stacks, the umount process was trying to stop the xfsaild kthread and waiting
> > > for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> > > wake-up.
> > > 
> > > [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> > > [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> > > [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> > > [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> > > [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> > > [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> > > [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> > > [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> > > [<ffffffff811ec866>] deactivate_super+0x46/0x60
> > > [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> > > [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> > > [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > 
> > > [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> > > [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> > > [<ffffffff81652958>] ret_from_fork+0x58/0x90
> > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > 
> > > The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> > > I have check the related code and suspect the same problem may also exists in
> > > the mainline.
> > > 
> > > The following is the possible sequences which may lead to the hang of umount:
> > > 
> > > xfsaild: kthread_should_stop() // return false, so xfsaild continue
> > > 
> > > umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> > > 
> > > umount: wake_up_process() // because xfsaild is still running, so 0 is returned
> >                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > This, to me, is where the problem lies. By the time unmount is
> > asking the aild to stop, the xfsaild should already be idle and
> > scheduled because unmount has just completed a syncrhonous push of
> > the AIL. i.e.  xfs_ail_push_all_sync()) waits for the AIL to empty
> > which should result in the aild returning to the idle state and
> > sleeping in freezable_schedule().
> > 
> 
> I think this behavior is to be expected. The xfsaild() logic schedules
> itself out without a timeout when the AIL is empty, but the task may not
> see the AIL as empty immediately because the empty state doesn't occur
> until I/O completion of the associated buffers removes all of the log
> items from the AIL.

Sure, but xfs_ail_push_all_sync() doesn't return until the AIL is
empty:

        spin_lock(&ailp->xa_lock);                                               
        while ((lip = xfs_ail_max(ailp)) != NULL) {                              
                prepare_to_wait(&ailp->xa_empty, &wait, TASK_UNINTERRUPTIBLE);   
                ailp->xa_target = lip->li_lsn;                                   
                wake_up_process(ailp->xa_task);                                  
                spin_unlock(&ailp->xa_lock);                                     
                schedule();                                                      
                spin_lock(&ailp->xa_lock);                                       
        }                                                                        
        spin_unlock(&ailp->xa_lock);                                             

And so the xfsaild should also be entering the empty, idle state on
it's next pass. Given that we then run a buftarg wait, cycle
superblock buffer locks and write an unmount record before we
tear down the AIL, I'm kinda suprised that the AIL hasn't actually
entered the full idle state here.

> > Work out why the aild is still running after the log has supposedly
> > been emptied and unmount records have been written first, then look
> > for a solution. Also, as Brian suggested, reproducing on an upstream
> > kernel is a good idea, because it's entirely possible this is a
> > vendor kernel (i.e.  RHEL) specific bug....
> > 
> 
> FWIW, I ran a quick test on for-next since there hasn't been a reply to
> this thread in that regard. Add a 10s delay between
> kthread_should_stop() and __set_current_state() in xfsaild (when
> unmounting and AIL is empty) and a 5s delay before kthread_stop() in
> xfs_trans_ail_destroy() and the problem reproduces consistently.

Right, you've forced the wakeup to occur directly in the place that
memory-barriers.txt says it will be ignored by putting exceedingly
long wait times into the loop....

> Checking kthread_should_stop() after we set the task state addresses the
> problem. This is because of the order of operations between
> kthread_stop() and xfsaild().

Yup, and that's something we've never actually cared about inside
the critical schedule loop because it's the AIL state operation that
matters for normal operation. i.e. if we miss a tail push wakeup we
could deadlock the log, and that's a much more noticable problem.

Miss a wakeup on unmount? Never occurred until now...  Whoever
thought that stopping a thread could be so damn complex? :/

> The former sets the stop bit and wakes the
> task. If the latter sets the task state and then checks the stop bit (as
> opposed to doing the opposite as it does currently), it will either see
> the stop bit and exit or the task state is reset to runnable such that
> it isn't blocked indefinitely (and the next iteration detects the stop
> bit).
> 
> Hou,
> 
> Care to update your patch with this information and the previous
> suggestions from Dave and I (pull up the check, add a comment, and make
> sure to reset the task state)?

The loop really needs to be completely restructured - we should only
need to check kthread_should_stop() once per loop cycle...

Cheers,

Dave.
Brian Foster Sept. 6, 2017, 12:18 p.m. UTC | #7
On Wed, Sep 06, 2017 at 09:47:42PM +1000, Dave Chinner wrote:
> On Wed, Sep 06, 2017 at 07:11:45AM -0400, Brian Foster wrote:
> > On Wed, Sep 06, 2017 at 09:00:43AM +1000, Dave Chinner wrote:
> > > On Tue, Sep 05, 2017 at 09:48:45PM +0800, Hou Tao wrote:
> > > > Hi all,
> > > > 
> > > > We recently encounter a XFS umount hang problem. As we can see the following
> > > > stacks, the umount process was trying to stop the xfsaild kthread and waiting
> > > > for the exit of the xfsaild thread, and the xfsaild thread was waiting for
> > > > wake-up.
> > > > 
> > > > [<ffffffff810a604a>] kthread_stop+0x4a/0xe0
> > > > [<ffffffffa0680317>] xfs_trans_ail_destroy+0x17/0x30 [xfs]
> > > > [<ffffffffa067569e>] xfs_log_unmount+0x1e/0x60 [xfs]
> > > > [<ffffffffa066ac15>] xfs_unmountfs+0xd5/0x190 [xfs]
> > > > [<ffffffffa066da62>] xfs_fs_put_super+0x32/0x90 [xfs]
> > > > [<ffffffff811ebad6>] generic_shutdown_super+0x56/0xe0
> > > > [<ffffffff811ebf27>] kill_block_super+0x27/0x70
> > > > [<ffffffff811ec269>] deactivate_locked_super+0x49/0x60
> > > > [<ffffffff811ec866>] deactivate_super+0x46/0x60
> > > > [<ffffffff81209995>] mntput_no_expire+0xc5/0x120
> > > > [<ffffffff8120aacf>] SyS_umount+0x9f/0x3c0
> > > > [<ffffffff81652a09>] system_call_fastpath+0x16/0x1b
> > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > > 
> > > > [<ffffffffa067faa7>] xfsaild+0x537/0x5e0 [xfs]
> > > > [<ffffffff810a5ddf>] kthread+0xcf/0xe0
> > > > [<ffffffff81652958>] ret_from_fork+0x58/0x90
> > > > [<ffffffffffffffff>] 0xffffffffffffffff
> > > > 
> > > > The kernel version is RHEL7.3 and we are trying to reproduce it (not yet).
> > > > I have check the related code and suspect the same problem may also exists in
> > > > the mainline.
> > > > 
> > > > The following is the possible sequences which may lead to the hang of umount:
> > > > 
> > > > xfsaild: kthread_should_stop() // return false, so xfsaild continue
> > > > 
> > > > umount: set_bit(KTHREAD_SHOULD_STOP, &kthread->flags) // by kthread_stop()
> > > > 
> > > > umount: wake_up_process() // because xfsaild is still running, so 0 is returned
> > >                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > > 
> > > This, to me, is where the problem lies. By the time unmount is
> > > asking the aild to stop, the xfsaild should already be idle and
> > > scheduled because unmount has just completed a syncrhonous push of
> > > the AIL. i.e.  xfs_ail_push_all_sync()) waits for the AIL to empty
> > > which should result in the aild returning to the idle state and
> > > sleeping in freezable_schedule().
> > > 
> > 
> > I think this behavior is to be expected. The xfsaild() logic schedules
> > itself out without a timeout when the AIL is empty, but the task may not
> > see the AIL as empty immediately because the empty state doesn't occur
> > until I/O completion of the associated buffers removes all of the log
> > items from the AIL.
> 
> Sure, but xfs_ail_push_all_sync() doesn't return until the AIL is
> empty:
> 
>         spin_lock(&ailp->xa_lock);                                               
>         while ((lip = xfs_ail_max(ailp)) != NULL) {                              
>                 prepare_to_wait(&ailp->xa_empty, &wait, TASK_UNINTERRUPTIBLE);   
>                 ailp->xa_target = lip->li_lsn;                                   
>                 wake_up_process(ailp->xa_task);                                  
>                 spin_unlock(&ailp->xa_lock);                                     
>                 schedule();                                                      
>                 spin_lock(&ailp->xa_lock);                                       
>         }                                                                        
>         spin_unlock(&ailp->xa_lock);                                             
> 
> And so the xfsaild should also be entering the empty, idle state on
> it's next pass. Given that we then run a buftarg wait, cycle
> superblock buffer locks and write an unmount record before we
> tear down the AIL, I'm kinda suprised that the AIL hasn't actually
> entered the full idle state here.
> 

Perhaps, but there's mutual exclusion and a wakeup event involved here
as well. If this task schedules first based on the wakeup event and
xfsaild happens to be just entering a max timeout sleep, it doesn't seem
that out of the ordinary for that to be enough time for the unmount
thread to move along far enough for the stop to race. It may be
interesting to see if bumping the "empty" timeout improves the ability
to reproduce, but even without the timeout, I don't think there's any
guarantee xfsaild ends up scheduled between the last sleep and before
the kthread_stop(). It seems to me we're simply subject to
nondeterministic scheduler conditions at that point (or maybe an
interrupt or something occurs, etc.). This seems to be quite rare, after
all.

...
> > Hou,
> > 
> > Care to update your patch with this information and the previous
> > suggestions from Dave and I (pull up the check, add a comment, and make
> > sure to reset the task state)?
> 
> The loop really needs to be completely restructured - we should only
> need to check kthread_should_stop() once per loop cycle...
> 

Ok, it seems reasonable to me to clean up the whole thing a bit.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Sept. 7, 2017, 12:24 p.m. UTC | #8
cc linux-xfs

On Thu, Sep 07, 2017 at 12:41:07AM +0200, Luis R. Rodriguez wrote:
> On Wed, Sep 06, 2017 at 07:11:45AM -0400, Brian Foster wrote:
> > FWIW, I ran a quick test on for-next since there hasn't been a reply to
> > this thread in that regard. Add a 10s delay between
> > kthread_should_stop() and __set_current_state() in xfsaild (when
> > unmounting and AIL is empty) and a 5s delay before kthread_stop() in
> > xfs_trans_ail_destroy() and the problem reproduces consistently.
> 
> Do you mean as follows?
> 

Essentially, though I also included empty AIL (!xfs_ail_min) and
unmounting (mp->m_flags & XFS_MOUNT_UNMOUNTING) logic to limit the
xfsaild delay to the case where care about.

Brian

> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 354368a906e5..475134833a4d 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -515,6 +515,7 @@ xfsaild(
>  	set_freezable();
>  
>  	while (!kthread_should_stop()) {
> +		ssleep(10);
>  		if (tout && tout <= 20)
>  			__set_current_state(TASK_KILLABLE);
>  		else
> @@ -805,6 +806,7 @@ xfs_trans_ail_destroy(
>  {
>  	struct xfs_ail	*ailp = mp->m_ail;
>  
> +	ssleep(5);
>  	kthread_stop(ailp->xa_task);
>  	kmem_free(ailp);
>  }
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 9056c0f..6313f67 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -520,6 +520,11 @@  xfsaild(
                if (!xfs_ail_min(ailp) &&
                    ailp->xa_target == ailp->xa_target_prev) {
                        spin_unlock(&ailp->xa_lock);
+
+                       smp_mb();
+                       if (kthread_should_stop())
+                               break;
+
                        freezable_schedule();
                        tout = 0;
                        continue;