diff mbox

Preserve task state in reentrant calls to ___wait_event

Message ID 20151106204408.GA11609@localhost (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

Chris Bainbridge Nov. 6, 2015, 8:44 p.m. UTC
In the ACPI SBS initialisation, a reentrant call to wait_event_timeout()
causes an intermittent boot stall of several minutes usually following
the "Switching to clocksource tsc" message. This stall is caused by:

 1. drivers/acpi/sbshc.c wait_transaction_complete() calls
    wait_event_timeout():

 	if (wait_event_timeout(hc->wait, smb_check_done(hc),
 			       msecs_to_jiffies(timeout)))

 2. ___wait_event sets task state to uninterruptible

 3. ___wait_event calls the "condition" smb_check_done()

 4. smb_check_done (sbshc.c) calls through to ec_read() in
    drivers/acpi/ec.c

 5. ec_guard() is reached which calls wait_event_timeout()

 	if (wait_event_timeout(ec->wait,
 			       ec_transaction_completed(ec),
 			       guard))

    ie. wait_event_timeout() is being called again inside evaluation of
    the previous wait_event_timeout() condition

 5. The EC IRQ handler calls wake_up() and wakes up the sleeping task in
    ec_guard()

 6. The task is now in state running even though the wait "condition" is
    still being evaluated

 7. The "condition" check returns false so ___wait_event calls
    schedule_timeout()

 8. Since the task state is running, the scheduler immediately schedules
    it again

 9. This loop repeats for around 250 seconds event though the original
    wait_event_timeout was only 1000ms.

    This happens because each the call to schedule_timeout() usually
    returns immediately, taking less than 1ms, so the jiffies timeout
    counter is not decremented. The task is now stuck in a running
    state, and so is highly likely to get rescheduled immediately, which
    takes less than a jiffy.

The root problem is that wait_event_timeout() does not preserve the task
state when called by tasks that are not running. We fix this by
preserving and restoring the task state in ___wait_event().

Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com>
---
I am assuming here that wait_event_timeout() is supposed to support reentrant
calls. If not, perhaps it should BUG_ON when called with a non-running task
state, and the SBS HC / ACPI EC code needs to be fixed to stop doing this. If
reentrant calls are supposed to work, then this patch will preserve the task
state (there may be a more appropriate way to support reentrant calls than this
exact patch, suggestions/alternatives are welcome, but this does work).
---
 include/linux/wait.h | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Comments

Rafael J. Wysocki Nov. 6, 2015, 11:06 p.m. UTC | #1
On Friday, November 06, 2015 08:44:08 PM Chris Bainbridge wrote:
> In the ACPI SBS initialisation, a reentrant call to wait_event_timeout()
> causes an intermittent boot stall of several minutes usually following
> the "Switching to clocksource tsc" message. This stall is caused by:
> 
>  1. drivers/acpi/sbshc.c wait_transaction_complete() calls
>     wait_event_timeout():
> 
>  	if (wait_event_timeout(hc->wait, smb_check_done(hc),
>  			       msecs_to_jiffies(timeout)))
> 
>  2. ___wait_event sets task state to uninterruptible
> 
>  3. ___wait_event calls the "condition" smb_check_done()
> 
>  4. smb_check_done (sbshc.c) calls through to ec_read() in
>     drivers/acpi/ec.c
> 
>  5. ec_guard() is reached which calls wait_event_timeout()
> 
>  	if (wait_event_timeout(ec->wait,
>  			       ec_transaction_completed(ec),
>  			       guard))
> 
>     ie. wait_event_timeout() is being called again inside evaluation of
>     the previous wait_event_timeout() condition

Hmm.

This doesn't look quite right to me.

>  5. The EC IRQ handler calls wake_up() and wakes up the sleeping task in
>     ec_guard()
> 
>  6. The task is now in state running even though the wait "condition" is
>     still being evaluated
>
>  7. The "condition" check returns false so ___wait_event calls
>     schedule_timeout()
> 
>  8. Since the task state is running, the scheduler immediately schedules
>     it again
> 
>  9. This loop repeats for around 250 seconds event though the original
>     wait_event_timeout was only 1000ms.
> 
>     This happens because each the call to schedule_timeout() usually
>     returns immediately, taking less than 1ms, so the jiffies timeout
>     counter is not decremented. The task is now stuck in a running
>     state, and so is highly likely to get rescheduled immediately, which
>     takes less than a jiffy.
> 
> The root problem is that wait_event_timeout() does not preserve the task
> state when called by tasks that are not running. We fix this by
> preserving and restoring the task state in ___wait_event().
> 
> Signed-off-by: Chris Bainbridge <chris.bainbridge@gmail.com>
> ---
> I am assuming here that wait_event_timeout() is supposed to support reentrant
> calls. If not, perhaps it should BUG_ON when called with a non-running task
> state, and the SBS HC / ACPI EC code needs to be fixed to stop doing this.

OK, so I'm not sure about that too.  Peter?

I'd fix the SBS HC / ACPI EC code to stop doing this regardless.

> If reentrant calls are supposed to work, then this patch will preserve the task
> state (there may be a more appropriate way to support reentrant calls than this
> exact patch, suggestions/alternatives are welcome, but this does work).

Thanks,
Rafael

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Bainbridge Nov. 7, 2015, 8:32 a.m. UTC | #2
On Fri, Nov 06, 2015 at 08:44:08PM +0000, Chris Bainbridge wrote:
> -#define ___wait_event(wq, condition, state, exclusive, ret, cmd)	\
> +#define ___wait_event(wq, condition, nstate, exclusive, ret, cmd)	\
>  ({									\
>  	__label__ __out;						\
>  	wait_queue_t __wait;						\
>  	long __ret = ret;	/* explicit shadow */			\
> +	long ostate = current->state;					\

XXX

>  									\
>  	INIT_LIST_HEAD(&__wait.task_list);				\
>  	if (exclusive)							\
> @@ -222,16 +223,16 @@ wait_queue_head_t *bit_waitqueue(void *, int);
>  		__wait.flags = 0;					\
>  									\
>  	for (;;) {							\
> -		long __int = prepare_to_wait_event(&wq, &__wait, state);\
> +		long __int = prepare_to_wait_event(&wq, &__wait, nstate);\
>  									\
>  		if (condition)						\
>  			break;						\
>  									\
> -		if (___wait_is_interruptible(state) && __int) {		\
> +		if (___wait_is_interruptible(nstate) && __int) {	\
>  			__ret = __int;					\
>  			if (exclusive) {				\
>  				abort_exclusive_wait(&wq, &__wait,	\
> -						     state, NULL);	\
> +						     nstate, NULL);	\
>  				goto __out;				\
>  			}						\
>  			break;						\
> @@ -240,6 +241,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
>  		cmd;							\
>  	}								\
>  	finish_wait(&wq, &__wait);					\
> +	set_current_state(ostate);					\

I'm not convinced that this particular code is (or can be) race free in
the general reentrant case. The outer call to ___wait_event will miss 
any wake_up received in the inner call between XXX above (store of
current->state) and this point of restoring the previous state. So if
the inner condition evaluation or some interrupt handler happens to
trigger a wake_up meant for the outer call then it will be lost.

>  __out:	__ret;								\
>  })
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 7, 2015, 9:19 a.m. UTC | #3
On Fri, Nov 06, 2015 at 08:44:08PM +0000, Chris Bainbridge wrote:
> I am assuming here that wait_event_timeout() is supposed to support reentrant
> calls.

Not really. It is sort of allowed, provided the inner one will rarely
block. And therefore the outer one will mostly work.

> If not, perhaps it should BUG_ON when called with a non-running task
> state,

There is a warning in __might_sleep() that catches some of this.

> and the SBS HC / ACPI EC code needs to be fixed to stop doing this.

Yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 1e1bf9f..a847cf8 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -209,11 +209,12 @@  wait_queue_head_t *bit_waitqueue(void *, int);
  * otherwise.
  */
 
-#define ___wait_event(wq, condition, state, exclusive, ret, cmd)	\
+#define ___wait_event(wq, condition, nstate, exclusive, ret, cmd)	\
 ({									\
 	__label__ __out;						\
 	wait_queue_t __wait;						\
 	long __ret = ret;	/* explicit shadow */			\
+	long ostate = current->state;					\
 									\
 	INIT_LIST_HEAD(&__wait.task_list);				\
 	if (exclusive)							\
@@ -222,16 +223,16 @@  wait_queue_head_t *bit_waitqueue(void *, int);
 		__wait.flags = 0;					\
 									\
 	for (;;) {							\
-		long __int = prepare_to_wait_event(&wq, &__wait, state);\
+		long __int = prepare_to_wait_event(&wq, &__wait, nstate);\
 									\
 		if (condition)						\
 			break;						\
 									\
-		if (___wait_is_interruptible(state) && __int) {		\
+		if (___wait_is_interruptible(nstate) && __int) {	\
 			__ret = __int;					\
 			if (exclusive) {				\
 				abort_exclusive_wait(&wq, &__wait,	\
-						     state, NULL);	\
+						     nstate, NULL);	\
 				goto __out;				\
 			}						\
 			break;						\
@@ -240,6 +241,7 @@  wait_queue_head_t *bit_waitqueue(void *, int);
 		cmd;							\
 	}								\
 	finish_wait(&wq, &__wait);					\
+	set_current_state(ostate);					\
 __out:	__ret;								\
 })