diff mbox series

sched: Fix data-race in wakeup

Message ID 20201117083016.GK3121392@hirez.programming.kicks-ass.net (mailing list archive)
State New, archived
Headers show
Series sched: Fix data-race in wakeup | expand

Commit Message

Peter Zijlstra Nov. 17, 2020, 8:30 a.m. UTC
On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote:

> And this works.

Yay!

> sched_psi_wake_requeue can probably stay with the other three fields
> given they are under the rq lock but sched_remote_wakeup needs to move
> out.

I _think_ we can move the bit into the unserialized section below.

It's a bit cheecky, but it should work I think because the only time we
actually use this bit, we're guaranteed the task isn't actually running,
so current doesn't exist.

I suppose the question is wether this is worth saving 31 bits over...

How's this?

---
Subject: sched: Fix data-race in wakeup
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue Nov 17 09:08:41 CET 2020

Mel reported that on some ARM64 platforms loadavg goes bananas and
tracked it down to the following data race:

  CPU0					CPU1

  schedule()
    prev->sched_contributes_to_load = X;
    deactivate_task(prev);

					try_to_wake_up()
					  if (p->on_rq &&) // false
					  if (smp_load_acquire(&p->on_cpu) && // true
					      ttwu_queue_wakelist())
					        p->sched_remote_wakeup = Y;

    smp_store_release(prev->on_cpu, 0);

where both p->sched_contributes_to_load and p->sched_remote_wakeup are
in the same word, and thus the stores X and Y race (and can clobber
one another's data).

Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") the p->on_cpu handoff serialized access to
p->sched_remote_wakeup (just as it still does with
p->sched_contributes_to_load) that commit broke that by calling
ttwu_queue_wakelist() with p->on_cpu != 0.

However, due to

  p->XXX			ttwu()
  schedule()			  if (p->on_rq && ...) // false
    smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
    deactivate_task()		      ttwu_queue_wakelist())
      p->on_rq = 0;		        p->sched_remote_wakeup = X;

We can be sure any 'current' store is complete and 'current' is
guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
the current flags word.

Note: while the observed failure was loadavg accounting gone wrong due
to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
is also possible where schedule() clobbers p->sched_remote_wakeup,
this could result in enqueue_entity() wrecking ->vruntime and causing
scheduling artifacts.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 include/linux/sched.h |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Will Deacon Nov. 17, 2020, 9:15 a.m. UTC | #1
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote:
> 
> > And this works.
> 
> Yay!
> 
> > sched_psi_wake_requeue can probably stay with the other three fields
> > given they are under the rq lock but sched_remote_wakeup needs to move
> > out.
> 
> I _think_ we can move the bit into the unserialized section below.
> 
> It's a bit cheecky, but it should work I think because the only time we
> actually use this bit, we're guaranteed the task isn't actually running,
> so current doesn't exist.
> 
> I suppose the question is wether this is worth saving 31 bits over...
> 
> How's this?
> 
> ---
> Subject: sched: Fix data-race in wakeup
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Nov 17 09:08:41 CET 2020
> 
> Mel reported that on some ARM64 platforms loadavg goes bananas and
> tracked it down to the following data race:
> 
>   CPU0					CPU1
> 
>   schedule()
>     prev->sched_contributes_to_load = X;
>     deactivate_task(prev);
> 
> 					try_to_wake_up()
> 					  if (p->on_rq &&) // false
> 					  if (smp_load_acquire(&p->on_cpu) && // true
> 					      ttwu_queue_wakelist())
> 					        p->sched_remote_wakeup = Y;
> 
>     smp_store_release(prev->on_cpu, 0);

(nit: I suggested this race over at [1] ;)

> where both p->sched_contributes_to_load and p->sched_remote_wakeup are
> in the same word, and thus the stores X and Y race (and can clobber
> one another's data).
> 
> Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> spinning on p->on_cpu") the p->on_cpu handoff serialized access to
> p->sched_remote_wakeup (just as it still does with
> p->sched_contributes_to_load) that commit broke that by calling
> ttwu_queue_wakelist() with p->on_cpu != 0.
> 
> However, due to
> 
>   p->XXX			ttwu()
>   schedule()			  if (p->on_rq && ...) // false
>     smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
>     deactivate_task()		      ttwu_queue_wakelist())
>       p->on_rq = 0;		        p->sched_remote_wakeup = X;
> 
> We can be sure any 'current' store is complete and 'current' is
> guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
> the current flags word.
> 
> Note: while the observed failure was loadavg accounting gone wrong due
> to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
> is also possible where schedule() clobbers p->sched_remote_wakeup,
> this could result in enqueue_entity() wrecking ->vruntime and causing
> scheduling artifacts.
> 
> Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> Reported-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  include/linux/sched.h |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
>  	unsigned			sched_reset_on_fork:1;
>  	unsigned			sched_contributes_to_load:1;
>  	unsigned			sched_migrated:1;
> -	unsigned			sched_remote_wakeup:1;
>  #ifdef CONFIG_PSI
>  	unsigned			sched_psi_wake_requeue:1;
>  #endif
> @@ -785,6 +784,18 @@ struct task_struct {
>  
>  	/* Unserialized, strictly 'current' */
>  
> +	/*
> +	 * p->in_iowait = 1;		ttwu()
> +	 * schedule()			  if (p->on_rq && ..) // false
> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> +	 *   deactivate_task()		      ttwu_queue_wakelist())
> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = X;
> +	 *
> +	 * Guarantees all stores of 'current' are visible before
> +	 * ->sched_remote_wakeup gets used.

I'm still not sure this is particularly clear -- don't we want to highlight
that the store of p->on_rq is unordered wrt the update to
p->sched_contributes_to_load() in deactivate_task()?

I dislike bitfields with a passion, but the fix looks good:

Acked-by: Will Deacon <will@kernel.org>

Now the million dollar question is why KCSAN hasn't run into this. Hrmph.

Will

[1] https://lore.kernel.org/r/20201116131102.GA29992@willie-the-truck
Peter Zijlstra Nov. 17, 2020, 9:29 a.m. UTC | #2
On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:

> > Subject: sched: Fix data-race in wakeup
> > From: Peter Zijlstra <peterz@infradead.org>
> > Date: Tue Nov 17 09:08:41 CET 2020
> > 
> > Mel reported that on some ARM64 platforms loadavg goes bananas and
> > tracked it down to the following data race:
> > 
> >   CPU0					CPU1
> > 
> >   schedule()
> >     prev->sched_contributes_to_load = X;
> >     deactivate_task(prev);
> > 
> > 					try_to_wake_up()
> > 					  if (p->on_rq &&) // false
> > 					  if (smp_load_acquire(&p->on_cpu) && // true
> > 					      ttwu_queue_wakelist())
> > 					        p->sched_remote_wakeup = Y;
> > 
> >     smp_store_release(prev->on_cpu, 0);
> 
> (nit: I suggested this race over at [1] ;)

Ah, I'll ammend and get you a Debugged-by line or something ;-)

> > where both p->sched_contributes_to_load and p->sched_remote_wakeup are
> > in the same word, and thus the stores X and Y race (and can clobber
> > one another's data).
> > 
> > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
> > spinning on p->on_cpu") the p->on_cpu handoff serialized access to
> > p->sched_remote_wakeup (just as it still does with
> > p->sched_contributes_to_load) that commit broke that by calling
> > ttwu_queue_wakelist() with p->on_cpu != 0.
> > 
> > However, due to
> > 
> >   p->XXX			ttwu()
> >   schedule()			  if (p->on_rq && ...) // false
> >     smp_mb__after_spinlock()	  if (smp_load_acquire(&p->on_cpu) &&
> >     deactivate_task()		      ttwu_queue_wakelist())
> >       p->on_rq = 0;		        p->sched_remote_wakeup = X;
> > 
> > We can be sure any 'current' store is complete and 'current' is
> > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into
> > the current flags word.
> > 
> > Note: while the observed failure was loadavg accounting gone wrong due
> > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem
> > is also possible where schedule() clobbers p->sched_remote_wakeup,
> > this could result in enqueue_entity() wrecking ->vruntime and causing
> > scheduling artifacts.
> > 
> > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> > Reported-by: Mel Gorman <mgorman@techsingularity.net>
> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> > ---
> >  include/linux/sched.h |   13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -775,7 +775,6 @@ struct task_struct {
> >  	unsigned			sched_reset_on_fork:1;
> >  	unsigned			sched_contributes_to_load:1;
> >  	unsigned			sched_migrated:1;
> > -	unsigned			sched_remote_wakeup:1;
> >  #ifdef CONFIG_PSI
> >  	unsigned			sched_psi_wake_requeue:1;
> >  #endif
> > @@ -785,6 +784,18 @@ struct task_struct {
> >  
> >  	/* Unserialized, strictly 'current' */
> >  
> > +	/*
> > +	 * p->in_iowait = 1;		ttwu()
> > +	 * schedule()			  if (p->on_rq && ..) // false
> > +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> > +	 *   deactivate_task()		      ttwu_queue_wakelist())
> > +	 *     p->on_rq = 0;			p->sched_remote_wakeup = X;
> > +	 *
> > +	 * Guarantees all stores of 'current' are visible before
> > +	 * ->sched_remote_wakeup gets used.
> 
> I'm still not sure this is particularly clear -- don't we want to highlight
> that the store of p->on_rq is unordered wrt the update to
> p->sched_contributes_to_load() in deactivate_task()?

I can explicitly call that out I suppose.

> I dislike bitfields with a passion, but the fix looks good:

I don't particularly hate them, they're just a flag field with names on
(in this case).

> Acked-by: Will Deacon <will@kernel.org>

Thanks!

> Now the million dollar question is why KCSAN hasn't run into this. Hrmph.

kernel/sched/Makefile:KCSAN_SANITIZE := n

might have something to do with that, I suppose.
Peter Zijlstra Nov. 17, 2020, 9:46 a.m. UTC | #3
On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > >  	/* Unserialized, strictly 'current' */
> > >  
> > > +	/*
> > > +	 * p->in_iowait = 1;		ttwu()
> > > +	 * schedule()			  if (p->on_rq && ..) // false
> > > +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> > > +	 *   deactivate_task()		      ttwu_queue_wakelist())
> > > +	 *     p->on_rq = 0;			p->sched_remote_wakeup = X;
> > > +	 *
> > > +	 * Guarantees all stores of 'current' are visible before
> > > +	 * ->sched_remote_wakeup gets used.
> > 
> > I'm still not sure this is particularly clear -- don't we want to highlight
> > that the store of p->on_rq is unordered wrt the update to
> > p->sched_contributes_to_load() in deactivate_task()?

How's this then? It still doesn't explicitly call out the specific race,
but does mention the more fundamental issue that wakelist queueing
doesn't respect the regular rules anymore.

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,6 @@ struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
-	unsigned			sched_remote_wakeup:1;
 #ifdef CONFIG_PSI
 	unsigned			sched_psi_wake_requeue:1;
 #endif
@@ -785,6 +784,21 @@ struct task_struct {
 
 	/* Unserialized, strictly 'current' */
 
+	/*
+	 * This field must not be in the scheduler word above due to wakelist
+	 * queueing no longer being serialized by p->on_cpu. However:
+	 *
+	 * p->XXX = X;			ttwu()
+	 * schedule()			  if (p->on_rq && ..) // false
+	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
+	 *   deactivate_task()		      ttwu_queue_wakelist())
+	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
+	 *
+	 * guarantees all stores of 'current' are visible before
+	 * ->sched_remote_wakeup gets used, so it can be in this word.
+	 */
+	unsigned			sched_remote_wakeup:1;
+
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
 	unsigned			in_iowait:1;
Will Deacon Nov. 17, 2020, 10:36 a.m. UTC | #4
On Tue, Nov 17, 2020 at 10:46:21AM +0100, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote:
> > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > > >  	/* Unserialized, strictly 'current' */
> > > >  
> > > > +	/*
> > > > +	 * p->in_iowait = 1;		ttwu()
> > > > +	 * schedule()			  if (p->on_rq && ..) // false
> > > > +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> > > > +	 *   deactivate_task()		      ttwu_queue_wakelist())
> > > > +	 *     p->on_rq = 0;			p->sched_remote_wakeup = X;
> > > > +	 *
> > > > +	 * Guarantees all stores of 'current' are visible before
> > > > +	 * ->sched_remote_wakeup gets used.
> > > 
> > > I'm still not sure this is particularly clear -- don't we want to highlight
> > > that the store of p->on_rq is unordered wrt the update to
> > > p->sched_contributes_to_load() in deactivate_task()?
> 
> How's this then? It still doesn't explicitly call out the specific race,
> but does mention the more fundamental issue that wakelist queueing
> doesn't respect the regular rules anymore.
> 
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
>  	unsigned			sched_reset_on_fork:1;
>  	unsigned			sched_contributes_to_load:1;
>  	unsigned			sched_migrated:1;
> -	unsigned			sched_remote_wakeup:1;
>  #ifdef CONFIG_PSI
>  	unsigned			sched_psi_wake_requeue:1;
>  #endif
> @@ -785,6 +784,21 @@ struct task_struct {
>  
>  	/* Unserialized, strictly 'current' */
>  
> +	/*
> +	 * This field must not be in the scheduler word above due to wakelist
> +	 * queueing no longer being serialized by p->on_cpu. However:
> +	 *
> +	 * p->XXX = X;			ttwu()
> +	 * schedule()			  if (p->on_rq && ..) // false
> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> +	 *   deactivate_task()		      ttwu_queue_wakelist())
> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
> +	 *
> +	 * guarantees all stores of 'current' are visible before
> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
> +	 */
> +	unsigned			sched_remote_wakeup:1;

Much better, thanks!

Will
Mel Gorman Nov. 17, 2020, 12:40 p.m. UTC | #5
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote:
> > sched_psi_wake_requeue can probably stay with the other three fields
> > given they are under the rq lock but sched_remote_wakeup needs to move
> > out.
> 
> I _think_ we can move the bit into the unserialized section below.
> 
> It's a bit cheecky, but it should work I think because the only time we
> actually use this bit, we're guaranteed the task isn't actually running,
> so current doesn't exist.
> 

Putting the bit there has the added advantage that if the bit existed
on its own that it would be very special in terms of how it should be
treated. Adding a bit adjacent to it would be potentially hazardous.

> ---
> Subject: sched: Fix data-race in wakeup
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Tue Nov 17 09:08:41 CET 2020
>  
> <SNIP>
> 
> Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
> Reported-by: Mel Gorman <mgorman@techsingularity.net>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks, testing completed successfully! With the suggested alternative
comment above sched_remote_wakeup;

Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Valentin Schneider Nov. 17, 2020, 12:52 p.m. UTC | #6
On 17/11/20 09:46, Peter Zijlstra wrote:
> How's this then? It still doesn't explicitly call out the specific race,
> but does mention the more fundamental issue that wakelist queueing
> doesn't respect the regular rules anymore.
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -775,7 +775,6 @@ struct task_struct {
>  	unsigned			sched_reset_on_fork:1;
>  	unsigned			sched_contributes_to_load:1;
>  	unsigned			sched_migrated:1;
> -	unsigned			sched_remote_wakeup:1;
>  #ifdef CONFIG_PSI
>  	unsigned			sched_psi_wake_requeue:1;
>  #endif
> @@ -785,6 +784,21 @@ struct task_struct {
>  
>  	/* Unserialized, strictly 'current' */
>  
> +	/*
> +	 * This field must not be in the scheduler word above due to wakelist
> +	 * queueing no longer being serialized by p->on_cpu. However:
> +	 *
> +	 * p->XXX = X;			ttwu()
> +	 * schedule()			  if (p->on_rq && ..) // false
> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> +	 *   deactivate_task()		      ttwu_queue_wakelist())
> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
> +	 *
> +	 * guarantees all stores of 'current' are visible before
> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
> +	 */

Isn't the control dep between that ttwu() p->on_rq read and
p->sched_remote_wakeup write "sufficient"? That should be giving the right
ordering for the rest of ttwu() wrt. those 'current' bits, considering they
are written before that smp_mb__after_spinlock().

In any case, consider me convinced:

Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

> +	unsigned			sched_remote_wakeup:1;
> +
>  	/* Bit to tell LSMs we're in execve(): */
>  	unsigned			in_execve:1;
>  	unsigned			in_iowait:1;
Valentin Schneider Nov. 17, 2020, 3:37 p.m. UTC | #7
On 17/11/20 12:52, Valentin Schneider wrote:
> On 17/11/20 09:46, Peter Zijlstra wrote:
>> How's this then? It still doesn't explicitly call out the specific race,
>> but does mention the more fundamental issue that wakelist queueing
>> doesn't respect the regular rules anymore.
>>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -775,7 +775,6 @@ struct task_struct {
>>  	unsigned			sched_reset_on_fork:1;
>>  	unsigned			sched_contributes_to_load:1;
>>  	unsigned			sched_migrated:1;
>> -	unsigned			sched_remote_wakeup:1;
>>  #ifdef CONFIG_PSI
>>  	unsigned			sched_psi_wake_requeue:1;
>>  #endif
>> @@ -785,6 +784,21 @@ struct task_struct {
>>  
>>  	/* Unserialized, strictly 'current' */
>>  
>> +	/*
>> +	 * This field must not be in the scheduler word above due to wakelist
>> +	 * queueing no longer being serialized by p->on_cpu. However:
>> +	 *
>> +	 * p->XXX = X;			ttwu()
>> +	 * schedule()			  if (p->on_rq && ..) // false
>> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
>> +	 *   deactivate_task()		      ttwu_queue_wakelist())
>> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
>> +	 *
>> +	 * guarantees all stores of 'current' are visible before
>> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
>> +	 */
>
> Isn't the control dep between that ttwu() p->on_rq read and
> p->sched_remote_wakeup write "sufficient"?

smp_acquire__after_ctrl_dep() that is, since we need
  ->on_rq load => 'current' bits load + store

> That should be giving the right
> ordering for the rest of ttwu() wrt. those 'current' bits, considering they
> are written before that smp_mb__after_spinlock().
>
> In any case, consider me convinced:
>
> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
>
>> +	unsigned			sched_remote_wakeup:1;
>> +
>>  	/* Bit to tell LSMs we're in execve(): */
>>  	unsigned			in_execve:1;
>>  	unsigned			in_iowait:1;
Peter Zijlstra Nov. 17, 2020, 4:13 p.m. UTC | #8
On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:

> >> +	/*
> >> +	 * This field must not be in the scheduler word above due to wakelist
> >> +	 * queueing no longer being serialized by p->on_cpu. However:
> >> +	 *
> >> +	 * p->XXX = X;			ttwu()
> >> +	 * schedule()			  if (p->on_rq && ..) // false
> >> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> >> +	 *   deactivate_task()		      ttwu_queue_wakelist())
> >> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
> >> +	 *
> >> +	 * guarantees all stores of 'current' are visible before
> >> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
> >> +	 */
> >
> > Isn't the control dep between that ttwu() p->on_rq read and
> > p->sched_remote_wakeup write "sufficient"?
> 
> smp_acquire__after_ctrl_dep() that is, since we need
>   ->on_rq load => 'current' bits load + store

I don't think we need that extra barrier; after all, there will be a
complete schedule() between waking the task and it actually becoming
current.
Valentin Schneider Nov. 17, 2020, 7:32 p.m. UTC | #9
On 17/11/20 16:13, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
>
>> >> +	/*
>> >> +	 * This field must not be in the scheduler word above due to wakelist
>> >> +	 * queueing no longer being serialized by p->on_cpu. However:
>> >> +	 *
>> >> +	 * p->XXX = X;			ttwu()
>> >> +	 * schedule()			  if (p->on_rq && ..) // false
>> >> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
>> >> +	 *   deactivate_task()		      ttwu_queue_wakelist())
>> >> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
>> >> +	 *
>> >> +	 * guarantees all stores of 'current' are visible before
>> >> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
>> >> +	 */
>> >
>> > Isn't the control dep between that ttwu() p->on_rq read and
>> > p->sched_remote_wakeup write "sufficient"?
>> 
>> smp_acquire__after_ctrl_dep() that is, since we need
>>   ->on_rq load => 'current' bits load + store
>
> I don't think we need that extra barrier; after all, there will be a
> complete schedule() between waking the task and it actually becoming
> current.

Apologies for the messy train of thought; what I was trying to say is that
we have already the following, which AIUI is sufficient:

	* p->XXX = X;			ttwu()
	* schedule()			  if (p->on_rq && ..) // false
	*   smp_mb__after_spinlock();	  smp_acquire__after_ctrl_dep();
	*   deactivate_task()		  ttwu_queue_wakelist()
	*     p->on_rq = 0;		    p->sched_remote_wakeup = Y;
Peter Zijlstra Nov. 18, 2020, 8:05 a.m. UTC | #10
On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote:
> 
> On 17/11/20 16:13, Peter Zijlstra wrote:
> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
> >
> >> >> +	/*
> >> >> +	 * This field must not be in the scheduler word above due to wakelist
> >> >> +	 * queueing no longer being serialized by p->on_cpu. However:
> >> >> +	 *
> >> >> +	 * p->XXX = X;			ttwu()
> >> >> +	 * schedule()			  if (p->on_rq && ..) // false
> >> >> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
> >> >> +	 *   deactivate_task()		      ttwu_queue_wakelist())
> >> >> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
> >> >> +	 *
> >> >> +	 * guarantees all stores of 'current' are visible before
> >> >> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
> >> >> +	 */
> >> >
> >> > Isn't the control dep between that ttwu() p->on_rq read and
> >> > p->sched_remote_wakeup write "sufficient"?
> >> 
> >> smp_acquire__after_ctrl_dep() that is, since we need
> >>   ->on_rq load => 'current' bits load + store
> >
> > I don't think we need that extra barrier; after all, there will be a
> > complete schedule() between waking the task and it actually becoming
> > current.
> 
> Apologies for the messy train of thought; what I was trying to say is that
> we have already the following, which AIUI is sufficient:
> 
> 	* p->XXX = X;			ttwu()
> 	* schedule()			  if (p->on_rq && ..) // false
> 	*   smp_mb__after_spinlock();	  smp_acquire__after_ctrl_dep();
> 	*   deactivate_task()		  ttwu_queue_wakelist()
> 	*     p->on_rq = 0;		    p->sched_remote_wakeup = Y;
> 

Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's
not required here either ;-)

The reason I had the ->on_cpu thing in there is because it shows we
violate the regular ->on_cpu handoff rules, not for the acquire.

The only ordering that matters on the RHS of that thing is the ->on_rq
load to p->sched_remote_wakeup store ctrl dep. That, combined with the
LHS, guarantees there is a strict order on the stores.

Makes sense?
Valentin Schneider Nov. 18, 2020, 9:51 a.m. UTC | #11
On 18/11/20 08:05, Peter Zijlstra wrote:
> On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote:
>> 
>> On 17/11/20 16:13, Peter Zijlstra wrote:
>> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote:
>> >
>> >> >> +	/*
>> >> >> +	 * This field must not be in the scheduler word above due to wakelist
>> >> >> +	 * queueing no longer being serialized by p->on_cpu. However:
>> >> >> +	 *
>> >> >> +	 * p->XXX = X;			ttwu()
>> >> >> +	 * schedule()			  if (p->on_rq && ..) // false
>> >> >> +	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
>> >> >> +	 *   deactivate_task()		      ttwu_queue_wakelist())
>> >> >> +	 *     p->on_rq = 0;			p->sched_remote_wakeup = Y;
>> >> >> +	 *
>> >> >> +	 * guarantees all stores of 'current' are visible before
>> >> >> +	 * ->sched_remote_wakeup gets used, so it can be in this word.
>> >> >> +	 */
>> >> >
>> >> > Isn't the control dep between that ttwu() p->on_rq read and
>> >> > p->sched_remote_wakeup write "sufficient"?
>> >> 
>> >> smp_acquire__after_ctrl_dep() that is, since we need
>> >>   ->on_rq load => 'current' bits load + store
>> >
>> > I don't think we need that extra barrier; after all, there will be a
>> > complete schedule() between waking the task and it actually becoming
>> > current.
>> 
>> Apologies for the messy train of thought; what I was trying to say is that
>> we have already the following, which AIUI is sufficient:
>> 
>> 	* p->XXX = X;			ttwu()
>> 	* schedule()			  if (p->on_rq && ..) // false
>> 	*   smp_mb__after_spinlock();	  smp_acquire__after_ctrl_dep();
>> 	*   deactivate_task()		  ttwu_queue_wakelist()
>> 	*     p->on_rq = 0;		    p->sched_remote_wakeup = Y;
>> 
>
> Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's
> not required here either ;-)
>
> The reason I had the ->on_cpu thing in there is because it shows we
> violate the regular ->on_cpu handoff rules, not for the acquire.
>

Gotcha

> The only ordering that matters on the RHS of that thing is the ->on_rq
> load to p->sched_remote_wakeup store ctrl dep. That, combined with the
> LHS, guarantees there is a strict order on the stores.
>
> Makes sense?

Yep, thanks!
Marco Elver Nov. 18, 2020, 1:33 p.m. UTC | #12
On Tue, Nov 17, 2020 at 10:29AM +0100, Peter Zijlstra wrote:
[...]
> > Now the million dollar question is why KCSAN hasn't run into this. Hrmph.
> 
> kernel/sched/Makefile:KCSAN_SANITIZE := n
> 
> might have something to do with that, I suppose.

For the record, I tried to reproduce this data race. I found a
read/write race on this bitfield, but not yet that write/write race
(perhaps I wasn't running the right workload).

	| read to 0xffff8d4e2ce39aac of 1 bytes by task 5269 on cpu 3:
	|  __sched_setscheduler+0x4a9/0x1070 kernel/sched/core.c:5297
	|  sched_setattr kernel/sched/core.c:5512 [inline]
	|  ...
	|
	| write to 0xffff8d4e2ce39aac of 1 bytes by task 5268 on cpu 1:
	|  __schedule+0x296/0xab0 kernel/sched/core.c:4462 		prev->sched_contributes_to_load =
	|  schedule+0xd1/0x130 kernel/sched/core.c:4601
	|  ...
	|
	| Full report: https://paste.debian.net/hidden/07a50732/

Getting to the above race also required some effort as 1) I kept hitting
other unrelated data races in the scheduler and had to silence those
first to be able to make progress, and 2) only enable KCSAN for
scheduler code to just ignore all other data races. Then I let syzkaller
run for a few minutes.

Also note our default KCSAN config is suboptimal. For serious debugging,
I'd recommend the same config that rcutorture uses with the --kcsan
flag, specifically:

	CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n,
	CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n

to get the full picture.

However, as a first step, it'd be nice to eventually remove the
KCSAN_SANITIZE := n from kernel/sched/Makefile when things are less
noisy (so that syzbot and default builds can start finding more serious
issues, too).

Thanks,
-- Marco
diff mbox series

Patch

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,6 @@  struct task_struct {
 	unsigned			sched_reset_on_fork:1;
 	unsigned			sched_contributes_to_load:1;
 	unsigned			sched_migrated:1;
-	unsigned			sched_remote_wakeup:1;
 #ifdef CONFIG_PSI
 	unsigned			sched_psi_wake_requeue:1;
 #endif
@@ -785,6 +784,18 @@  struct task_struct {
 
 	/* Unserialized, strictly 'current' */
 
+	/*
+	 * p->in_iowait = 1;		ttwu()
+	 * schedule()			  if (p->on_rq && ..) // false
+	 *   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true
+	 *   deactivate_task()		      ttwu_queue_wakelist())
+	 *     p->on_rq = 0;			p->sched_remote_wakeup = X;
+	 *
+	 * Guarantees all stores of 'current' are visible before
+	 * ->sched_remote_wakeup gets used.
+	 */
+	unsigned			sched_remote_wakeup:1;
+
 	/* Bit to tell LSMs we're in execve(): */
 	unsigned			in_execve:1;
 	unsigned			in_iowait:1;