Message ID | 20201117083016.GK3121392@hirez.programming.kicks-ass.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | sched: Fix data-race in wakeup | expand |
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > On Mon, Nov 16, 2020 at 07:31:49PM +0000, Mel Gorman wrote: > > > And this works. > > Yay! > > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > > I suppose the question is wether this is worth saving 31 bits over... > > How's this? > > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > Mel reported that on some ARM64 platforms loadavg goes bananas and > tracked it down to the following data race: > > CPU0 CPU1 > > schedule() > prev->sched_contributes_to_load = X; > deactivate_task(prev); > > try_to_wake_up() > if (p->on_rq &&) // false > if (smp_load_acquire(&p->on_cpu) && // true > ttwu_queue_wakelist()) > p->sched_remote_wakeup = Y; > > smp_store_release(prev->on_cpu, 0); (nit: I suggested this race over at [1] ;) > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > in the same word, and thus the stores X and Y race (and can clobber > one another's data). > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > p->sched_remote_wakeup (just as it still does with > p->sched_contributes_to_load) that commit broke that by calling > ttwu_queue_wakelist() with p->on_cpu != 0. > > However, due to > > p->XXX ttwu() > schedule() if (p->on_rq && ...) // false > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > deactivate_task() ttwu_queue_wakelist()) > p->on_rq = 0; p->sched_remote_wakeup = X; > > We can be sure any 'current' store is complete and 'current' is > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > the current flags word. > > Note: while the observed failure was loadavg accounting gone wrong due > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > is also possible where schedule() clobbers p->sched_remote_wakeup, > this could result in enqueue_entity() wrecking ->vruntime and causing > scheduling artifacts. > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > --- > include/linux/sched.h | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,18 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * p->in_iowait = 1; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = X; > + * > + * Guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used. I'm still not sure this is particularly clear -- don't we want to highlight that the store of p->on_rq is unordered wrt the update to p->sched_contributes_to_load() in deactivate_task()? I dislike bitfields with a passion, but the fix looks good: Acked-by: Will Deacon <will@kernel.org> Now the million dollar question is why KCSAN hasn't run into this. Hrmph. Will [1] https://lore.kernel.org/r/20201116131102.GA29992@willie-the-truck
On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > Subject: sched: Fix data-race in wakeup > > From: Peter Zijlstra <peterz@infradead.org> > > Date: Tue Nov 17 09:08:41 CET 2020 > > > > Mel reported that on some ARM64 platforms loadavg goes bananas and > > tracked it down to the following data race: > > > > CPU0 CPU1 > > > > schedule() > > prev->sched_contributes_to_load = X; > > deactivate_task(prev); > > > > try_to_wake_up() > > if (p->on_rq &&) // false > > if (smp_load_acquire(&p->on_cpu) && // true > > ttwu_queue_wakelist()) > > p->sched_remote_wakeup = Y; > > > > smp_store_release(prev->on_cpu, 0); > > (nit: I suggested this race over at [1] ;) Ah, I'll ammend and get you a Debugged-by line or something ;-) > > where both p->sched_contributes_to_load and p->sched_remote_wakeup are > > in the same word, and thus the stores X and Y race (and can clobber > > one another's data). > > > > Whereas prior to commit c6e7bd7afaeb ("sched/core: Optimize ttwu() > > spinning on p->on_cpu") the p->on_cpu handoff serialized access to > > p->sched_remote_wakeup (just as it still does with > > p->sched_contributes_to_load) that commit broke that by calling > > ttwu_queue_wakelist() with p->on_cpu != 0. > > > > However, due to > > > > p->XXX ttwu() > > schedule() if (p->on_rq && ...) // false > > smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && > > deactivate_task() ttwu_queue_wakelist()) > > p->on_rq = 0; p->sched_remote_wakeup = X; > > > > We can be sure any 'current' store is complete and 'current' is > > guaranteed asleep. Therefore we can move p->sched_remote_wakeup into > > the current flags word. > > > > Note: while the observed failure was loadavg accounting gone wrong due > > to ttwu() cobbering p->sched_contributes_to_load, the reverse problem > > is also possible where schedule() clobbers p->sched_remote_wakeup, > > this could result in enqueue_entity() wrecking ->vruntime and causing > > scheduling artifacts. > > > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > > Reported-by: Mel Gorman <mgorman@techsingularity.net> > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > --- > > include/linux/sched.h | 13 ++++++++++++- > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -775,7 +775,6 @@ struct task_struct { > > unsigned sched_reset_on_fork:1; > > unsigned sched_contributes_to_load:1; > > unsigned sched_migrated:1; > > - unsigned sched_remote_wakeup:1; > > #ifdef CONFIG_PSI > > unsigned sched_psi_wake_requeue:1; > > #endif > > @@ -785,6 +784,18 @@ struct task_struct { > > > > /* Unserialized, strictly 'current' */ > > > > + /* > > + * p->in_iowait = 1; ttwu() > > + * schedule() if (p->on_rq && ..) // false > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > + * deactivate_task() ttwu_queue_wakelist()) > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > + * > > + * Guarantees all stores of 'current' are visible before > > + * ->sched_remote_wakeup gets used. > > I'm still not sure this is particularly clear -- don't we want to highlight > that the store of p->on_rq is unordered wrt the update to > p->sched_contributes_to_load() in deactivate_task()? I can explicitly call that out I suppose. > I dislike bitfields with a passion, but the fix looks good: I don't particularly hate them, they're just a flag field with names on (in this case). > Acked-by: Will Deacon <will@kernel.org> Thanks! > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. kernel/sched/Makefile:KCSAN_SANITIZE := n might have something to do with that, I suppose.
On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > /* Unserialized, strictly 'current' */ > > > > > > + /* > > > + * p->in_iowait = 1; ttwu() > > > + * schedule() if (p->on_rq && ..) // false > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > + * deactivate_task() ttwu_queue_wakelist()) > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > + * > > > + * Guarantees all stores of 'current' are visible before > > > + * ->sched_remote_wakeup gets used. > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > that the store of p->on_rq is unordered wrt the update to > > p->sched_contributes_to_load() in deactivate_task()? How's this then? It still doesn't explicitly call out the specific race, but does mention the more fundamental issue that wakelist queueing doesn't respect the regular rules anymore. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,21 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * This field must not be in the scheduler word above due to wakelist + * queueing no longer being serialized by p->on_cpu. However: + * + * p->XXX = X; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = Y; + * + * guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used, so it can be in this word. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1;
On Tue, Nov 17, 2020 at 10:46:21AM +0100, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 10:29:36AM +0100, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 09:15:46AM +0000, Will Deacon wrote: > > > On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > > > /* Unserialized, strictly 'current' */ > > > > > > > > + /* > > > > + * p->in_iowait = 1; ttwu() > > > > + * schedule() if (p->on_rq && ..) // false > > > > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > > > > + * deactivate_task() ttwu_queue_wakelist()) > > > > + * p->on_rq = 0; p->sched_remote_wakeup = X; > > > > + * > > > > + * Guarantees all stores of 'current' are visible before > > > > + * ->sched_remote_wakeup gets used. > > > > > > I'm still not sure this is particularly clear -- don't we want to highlight > > > that the store of p->on_rq is unordered wrt the update to > > > p->sched_contributes_to_load() in deactivate_task()? > > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ > + unsigned sched_remote_wakeup:1; Much better, thanks! Will
On Tue, Nov 17, 2020 at 09:30:16AM +0100, Peter Zijlstra wrote: > > sched_psi_wake_requeue can probably stay with the other three fields > > given they are under the rq lock but sched_remote_wakeup needs to move > > out. > > I _think_ we can move the bit into the unserialized section below. > > It's a bit cheecky, but it should work I think because the only time we > actually use this bit, we're guaranteed the task isn't actually running, > so current doesn't exist. > Putting the bit there has the added advantage that if the bit existed on its own that it would be very special in terms of how it should be treated. Adding a bit adjacent to it would be potentially hazardous. > --- > Subject: sched: Fix data-race in wakeup > From: Peter Zijlstra <peterz@infradead.org> > Date: Tue Nov 17 09:08:41 CET 2020 > > <SNIP> > > Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") > Reported-by: Mel Gorman <mgorman@techsingularity.net> > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Thanks, testing completed successfully! With the suggested alternative comment above sched_remote_wakeup; Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
On 17/11/20 09:46, Peter Zijlstra wrote: > How's this then? It still doesn't explicitly call out the specific race, > but does mention the more fundamental issue that wakelist queueing > doesn't respect the regular rules anymore. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -775,7 +775,6 @@ struct task_struct { > unsigned sched_reset_on_fork:1; > unsigned sched_contributes_to_load:1; > unsigned sched_migrated:1; > - unsigned sched_remote_wakeup:1; > #ifdef CONFIG_PSI > unsigned sched_psi_wake_requeue:1; > #endif > @@ -785,6 +784,21 @@ struct task_struct { > > /* Unserialized, strictly 'current' */ > > + /* > + * This field must not be in the scheduler word above due to wakelist > + * queueing no longer being serialized by p->on_cpu. However: > + * > + * p->XXX = X; ttwu() > + * schedule() if (p->on_rq && ..) // false > + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > + * deactivate_task() ttwu_queue_wakelist()) > + * p->on_rq = 0; p->sched_remote_wakeup = Y; > + * > + * guarantees all stores of 'current' are visible before > + * ->sched_remote_wakeup gets used, so it can be in this word. > + */ Isn't the control dep between that ttwu() p->on_rq read and p->sched_remote_wakeup write "sufficient"? That should be giving the right ordering for the rest of ttwu() wrt. those 'current' bits, considering they are written before that smp_mb__after_spinlock(). In any case, consider me convinced: Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > + unsigned sched_remote_wakeup:1; > + > /* Bit to tell LSMs we're in execve(): */ > unsigned in_execve:1; > unsigned in_iowait:1;
On 17/11/20 12:52, Valentin Schneider wrote: > On 17/11/20 09:46, Peter Zijlstra wrote: >> How's this then? It still doesn't explicitly call out the specific race, >> but does mention the more fundamental issue that wakelist queueing >> doesn't respect the regular rules anymore. >> >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -775,7 +775,6 @@ struct task_struct { >> unsigned sched_reset_on_fork:1; >> unsigned sched_contributes_to_load:1; >> unsigned sched_migrated:1; >> - unsigned sched_remote_wakeup:1; >> #ifdef CONFIG_PSI >> unsigned sched_psi_wake_requeue:1; >> #endif >> @@ -785,6 +784,21 @@ struct task_struct { >> >> /* Unserialized, strictly 'current' */ >> >> + /* >> + * This field must not be in the scheduler word above due to wakelist >> + * queueing no longer being serialized by p->on_cpu. However: >> + * >> + * p->XXX = X; ttwu() >> + * schedule() if (p->on_rq && ..) // false >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> + * deactivate_task() ttwu_queue_wakelist()) >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> + * >> + * guarantees all stores of 'current' are visible before >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> + */ > > Isn't the control dep between that ttwu() p->on_rq read and > p->sched_remote_wakeup write "sufficient"? smp_acquire__after_ctrl_dep() that is, since we need ->on_rq load => 'current' bits load + store > That should be giving the right > ordering for the rest of ttwu() wrt. those 'current' bits, considering they > are written before that smp_mb__after_spinlock(). > > In any case, consider me convinced: > > Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> > >> + unsigned sched_remote_wakeup:1; >> + >> /* Bit to tell LSMs we're in execve(): */ >> unsigned in_execve:1; >> unsigned in_iowait:1;
On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> + /* > >> + * This field must not be in the scheduler word above due to wakelist > >> + * queueing no longer being serialized by p->on_cpu. However: > >> + * > >> + * p->XXX = X; ttwu() > >> + * schedule() if (p->on_rq && ..) // false > >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> + * deactivate_task() ttwu_queue_wakelist()) > >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> + * > >> + * guarantees all stores of 'current' are visible before > >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> + */ > > > > Isn't the control dep between that ttwu() p->on_rq read and > > p->sched_remote_wakeup write "sufficient"? > > smp_acquire__after_ctrl_dep() that is, since we need > ->on_rq load => 'current' bits load + store I don't think we need that extra barrier; after all, there will be a complete schedule() between waking the task and it actually becoming current.
On 17/11/20 16:13, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > >> >> + /* >> >> + * This field must not be in the scheduler word above due to wakelist >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> + * >> >> + * p->XXX = X; ttwu() >> >> + * schedule() if (p->on_rq && ..) // false >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> + * >> >> + * guarantees all stores of 'current' are visible before >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> + */ >> > >> > Isn't the control dep between that ttwu() p->on_rq read and >> > p->sched_remote_wakeup write "sufficient"? >> >> smp_acquire__after_ctrl_dep() that is, since we need >> ->on_rq load => 'current' bits load + store > > I don't think we need that extra barrier; after all, there will be a > complete schedule() between waking the task and it actually becoming > current. Apologies for the messy train of thought; what I was trying to say is that we have already the following, which AIUI is sufficient: * p->XXX = X; ttwu() * schedule() if (p->on_rq && ..) // false * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); * deactivate_task() ttwu_queue_wakelist() * p->on_rq = 0; p->sched_remote_wakeup = Y;
On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: > > On 17/11/20 16:13, Peter Zijlstra wrote: > > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: > > > >> >> + /* > >> >> + * This field must not be in the scheduler word above due to wakelist > >> >> + * queueing no longer being serialized by p->on_cpu. However: > >> >> + * > >> >> + * p->XXX = X; ttwu() > >> >> + * schedule() if (p->on_rq && ..) // false > >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true > >> >> + * deactivate_task() ttwu_queue_wakelist()) > >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; > >> >> + * > >> >> + * guarantees all stores of 'current' are visible before > >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. > >> >> + */ > >> > > >> > Isn't the control dep between that ttwu() p->on_rq read and > >> > p->sched_remote_wakeup write "sufficient"? > >> > >> smp_acquire__after_ctrl_dep() that is, since we need > >> ->on_rq load => 'current' bits load + store > > > > I don't think we need that extra barrier; after all, there will be a > > complete schedule() between waking the task and it actually becoming > > current. > > Apologies for the messy train of thought; what I was trying to say is that > we have already the following, which AIUI is sufficient: > > * p->XXX = X; ttwu() > * schedule() if (p->on_rq && ..) // false > * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); > * deactivate_task() ttwu_queue_wakelist() > * p->on_rq = 0; p->sched_remote_wakeup = Y; > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's not required here either ;-) The reason I had the ->on_cpu thing in there is because it shows we violate the regular ->on_cpu handoff rules, not for the acquire. The only ordering that matters on the RHS of that thing is the ->on_rq load to p->sched_remote_wakeup store ctrl dep. That, combined with the LHS, guarantees there is a strict order on the stores. Makes sense?
On 18/11/20 08:05, Peter Zijlstra wrote: > On Tue, Nov 17, 2020 at 07:32:16PM +0000, Valentin Schneider wrote: >> >> On 17/11/20 16:13, Peter Zijlstra wrote: >> > On Tue, Nov 17, 2020 at 03:37:24PM +0000, Valentin Schneider wrote: >> > >> >> >> + /* >> >> >> + * This field must not be in the scheduler word above due to wakelist >> >> >> + * queueing no longer being serialized by p->on_cpu. However: >> >> >> + * >> >> >> + * p->XXX = X; ttwu() >> >> >> + * schedule() if (p->on_rq && ..) // false >> >> >> + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true >> >> >> + * deactivate_task() ttwu_queue_wakelist()) >> >> >> + * p->on_rq = 0; p->sched_remote_wakeup = Y; >> >> >> + * >> >> >> + * guarantees all stores of 'current' are visible before >> >> >> + * ->sched_remote_wakeup gets used, so it can be in this word. >> >> >> + */ >> >> > >> >> > Isn't the control dep between that ttwu() p->on_rq read and >> >> > p->sched_remote_wakeup write "sufficient"? >> >> >> >> smp_acquire__after_ctrl_dep() that is, since we need >> >> ->on_rq load => 'current' bits load + store >> > >> > I don't think we need that extra barrier; after all, there will be a >> > complete schedule() between waking the task and it actually becoming >> > current. >> >> Apologies for the messy train of thought; what I was trying to say is that >> we have already the following, which AIUI is sufficient: >> >> * p->XXX = X; ttwu() >> * schedule() if (p->on_rq && ..) // false >> * smp_mb__after_spinlock(); smp_acquire__after_ctrl_dep(); >> * deactivate_task() ttwu_queue_wakelist() >> * p->on_rq = 0; p->sched_remote_wakeup = Y; >> > > Ah, you meant the existing smp_acquire__after_ctrl_dep(). Yeah, that's > not required here either ;-) > > The reason I had the ->on_cpu thing in there is because it shows we > violate the regular ->on_cpu handoff rules, not for the acquire. > Gotcha > The only ordering that matters on the RHS of that thing is the ->on_rq > load to p->sched_remote_wakeup store ctrl dep. That, combined with the > LHS, guarantees there is a strict order on the stores. > > Makes sense? Yep, thanks!
On Tue, Nov 17, 2020 at 10:29AM +0100, Peter Zijlstra wrote: [...] > > Now the million dollar question is why KCSAN hasn't run into this. Hrmph. > > kernel/sched/Makefile:KCSAN_SANITIZE := n > > might have something to do with that, I suppose. For the record, I tried to reproduce this data race. I found a read/write race on this bitfield, but not yet that write/write race (perhaps I wasn't running the right workload). | read to 0xffff8d4e2ce39aac of 1 bytes by task 5269 on cpu 3: | __sched_setscheduler+0x4a9/0x1070 kernel/sched/core.c:5297 | sched_setattr kernel/sched/core.c:5512 [inline] | ... | | write to 0xffff8d4e2ce39aac of 1 bytes by task 5268 on cpu 1: | __schedule+0x296/0xab0 kernel/sched/core.c:4462 prev->sched_contributes_to_load = | schedule+0xd1/0x130 kernel/sched/core.c:4601 | ... | | Full report: https://paste.debian.net/hidden/07a50732/ Getting to the above race also required some effort as 1) I kept hitting other unrelated data races in the scheduler and had to silence those first to be able to make progress, and 2) only enable KCSAN for scheduler code to just ignore all other data races. Then I let syzkaller run for a few minutes. Also note our default KCSAN config is suboptimal. For serious debugging, I'd recommend the same config that rcutorture uses with the --kcsan flag, specifically: CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=n, CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC=n to get the full picture. However, as a first step, it'd be nice to eventually remove the KCSAN_SANITIZE := n from kernel/sched/Makefile when things are less noisy (so that syzbot and default builds can start finding more serious issues, too). Thanks, -- Marco
--- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -775,7 +775,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; - unsigned sched_remote_wakeup:1; #ifdef CONFIG_PSI unsigned sched_psi_wake_requeue:1; #endif @@ -785,6 +784,18 @@ struct task_struct { /* Unserialized, strictly 'current' */ + /* + * p->in_iowait = 1; ttwu() + * schedule() if (p->on_rq && ..) // false + * smp_mb__after_spinlock(); if (smp_load_acquire(&p->on_cpu) && //true + * deactivate_task() ttwu_queue_wakelist()) + * p->on_rq = 0; p->sched_remote_wakeup = X; + * + * Guarantees all stores of 'current' are visible before + * ->sched_remote_wakeup gets used. + */ + unsigned sched_remote_wakeup:1; + /* Bit to tell LSMs we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1;