diff mbox series

[PATCHv2,1/3] rcu: Keep qsmaskinitnext fresh when rcutree_online_cpu()

Message ID 20220915055825.21525-2-kernelfans@gmail.com (mailing list archive)
State New, archived
Headers show
Series rcu: Enhance the capability to cope with concurrent cpu offlining/onlining | expand

Commit Message

Pingfan Liu Sept. 15, 2022, 5:58 a.m. UTC
rcutree_online_cpu() is concurrent, so there is the following race
scene:

        CPU 1                               CPU2
mask_old = rcu_rnp_online_cpus(rnp);
...

                                   mask_new = rcu_rnp_online_cpus(rnp);
                                   ...
                                   set_cpus_allowed_ptr(t, cm);

set_cpus_allowed_ptr(t, cm);

Consequently, the old mask will overwrite the new mask in the task's cpus_ptr.

Since there is a mutex ->boost_kthread_mutex, using it to build an
order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr.

But for concurrent offlining, ->qsmaskinitnext is not reilable when
rcutree_offline_cpu(). That is another story and comes in the following
patch.

Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
To: rcu@vger.kernel.org
---
 kernel/rcu/tree_plugin.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Pingfan Liu Sept. 15, 2022, 6:11 a.m. UTC | #1
Hi Paul,

If you pick up this one from my V1, could you drop it? Since I have
rephrased the commit log and comment in the code.  Sorry for the
inconvenience.

On Thu, Sep 15, 2022 at 1:58 PM Pingfan Liu <kernelfans@gmail.com> wrote:
>
> rcutree_online_cpu() is concurrent, so there is the following race
> scene:
>
>         CPU 1                               CPU2
> mask_old = rcu_rnp_online_cpus(rnp);
> ...
>
>                                    mask_new = rcu_rnp_online_cpus(rnp);
>                                    ...
>                                    set_cpus_allowed_ptr(t, cm);
>
> set_cpus_allowed_ptr(t, cm);
>
> Consequently, the old mask will overwrite the new mask in the task's cpus_ptr.
>
> Since there is a mutex ->boost_kthread_mutex, using it to build an
> order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr.
>
> But for concurrent offlining, ->qsmaskinitnext is not reilable when
> rcutree_offline_cpu(). That is another story and comes in the following
> patch.
>
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
> To: rcu@vger.kernel.org
> ---
>  kernel/rcu/tree_plugin.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 438ecae6bd7e..ef6d3ae239b9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>  {
>         struct task_struct *t = rnp->boost_kthread_task;
> -       unsigned long mask = rcu_rnp_online_cpus(rnp);
> +       unsigned long mask;
>         cpumask_var_t cm;
>         int cpu;
>
> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>         if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
>                 return;
>         mutex_lock(&rnp->boost_kthread_mutex);
> +       /*
> +        * Relying on the lock to serialize, so when onlining, the latest
> +        * qsmaskinitnext is for cpus_ptr.
> +        */

In the comment, I put accent on 'onlining' which is different from V1

Thanks,

Pingfan

> +       mask = rcu_rnp_online_cpus(rnp);
>         for_each_leaf_node_possible_cpu(rnp, cpu)
>                 if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>                     cpu != outgoingcpu)
> --
> 2.31.1
>
Frederic Weisbecker Sept. 16, 2022, 2:52 p.m. UTC | #2
On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote:
> rcutree_online_cpu() is concurrent, so there is the following race
> scene:
> 
>         CPU 1                               CPU2
> mask_old = rcu_rnp_online_cpus(rnp);
> ...
> 
>                                    mask_new = rcu_rnp_online_cpus(rnp);
>                                    ...
>                                    set_cpus_allowed_ptr(t, cm);
> 
> set_cpus_allowed_ptr(t, cm);
> 
> Consequently, the old mask will overwrite the new mask in the task's cpus_ptr.
> 
> Since there is a mutex ->boost_kthread_mutex, using it to build an
> order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr.
> 
> But for concurrent offlining, ->qsmaskinitnext is not reilable when
> rcutree_offline_cpu(). That is another story and comes in the following
> patch.
> 
> Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> Cc: "Paul E. McKenney" <paulmck@kernel.org>
> Cc: David Woodhouse <dwmw@amazon.co.uk>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> Cc: Josh Triplett <josh@joshtriplett.org>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> Cc: Joel Fernandes <joel@joelfernandes.org>
> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
> To: rcu@vger.kernel.org
> ---
>  kernel/rcu/tree_plugin.h | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 438ecae6bd7e..ef6d3ae239b9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>  {
>  	struct task_struct *t = rnp->boost_kthread_task;
> -	unsigned long mask = rcu_rnp_online_cpus(rnp);
> +	unsigned long mask;
>  	cpumask_var_t cm;
>  	int cpu;
>  
> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>  	if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
>  		return;
>  	mutex_lock(&rnp->boost_kthread_mutex);
> +	/*
> +	 * Relying on the lock to serialize, so when onlining, the latest
> +	 * qsmaskinitnext is for cpus_ptr.
> +	 */
> +	mask = rcu_rnp_online_cpus(rnp);
>  	for_each_leaf_node_possible_cpu(rnp, cpu)
>  		if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>  		    cpu != outgoingcpu)

Right but you still race against concurrent rcu_report_dead() doing:

      WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)

Thanks.

> -- 
> 2.31.1
>
Pingfan Liu Sept. 19, 2022, 10:24 a.m. UTC | #3
On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
<frederic@kernel.org> wrote:
>
> On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote:
> > rcutree_online_cpu() is concurrent, so there is the following race
> > scene:
> >
> >         CPU 1                               CPU2
> > mask_old = rcu_rnp_online_cpus(rnp);
> > ...
> >
> >                                    mask_new = rcu_rnp_online_cpus(rnp);
> >                                    ...
> >                                    set_cpus_allowed_ptr(t, cm);
> >
> > set_cpus_allowed_ptr(t, cm);
> >
> > Consequently, the old mask will overwrite the new mask in the task's cpus_ptr.
> >
> > Since there is a mutex ->boost_kthread_mutex, using it to build an
> > order, then the latest ->qsmaskinitnext will be fetched for updating cpus_ptr.
> >
> > But for concurrent offlining, ->qsmaskinitnext is not reilable when
> > rcutree_offline_cpu(). That is another story and comes in the following
> > patch.
> >
> > Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
> > Cc: "Paul E. McKenney" <paulmck@kernel.org>
> > Cc: David Woodhouse <dwmw@amazon.co.uk>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
> > Cc: Josh Triplett <josh@joshtriplett.org>
> > Cc: Steven Rostedt <rostedt@goodmis.org>
> > Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Cc: Lai Jiangshan <jiangshanlai@gmail.com>
> > Cc: Joel Fernandes <joel@joelfernandes.org>
> > Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
> > To: rcu@vger.kernel.org
> > ---
> >  kernel/rcu/tree_plugin.h | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > index 438ecae6bd7e..ef6d3ae239b9 100644
> > --- a/kernel/rcu/tree_plugin.h
> > +++ b/kernel/rcu/tree_plugin.h
> > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
> >  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> >  {
> >       struct task_struct *t = rnp->boost_kthread_task;
> > -     unsigned long mask = rcu_rnp_online_cpus(rnp);
> > +     unsigned long mask;
> >       cpumask_var_t cm;
> >       int cpu;
> >
> > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> >       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> >               return;
> >       mutex_lock(&rnp->boost_kthread_mutex);
> > +     /*
> > +      * Relying on the lock to serialize, so when onlining, the latest
> > +      * qsmaskinitnext is for cpus_ptr.
> > +      */
> > +     mask = rcu_rnp_online_cpus(rnp);
> >       for_each_leaf_node_possible_cpu(rnp, cpu)
> >               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
> >                   cpu != outgoingcpu)
>
> Right but you still race against concurrent rcu_report_dead() doing:
>
>       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
>

Ah. Indeed, my commit log is not precisely described.

In fact, either speedup smp_init [1] or fast kexec reboot [2] still
uses the model: one hotplug initiator and several reactors.  The
initiators are still excluded from each other by the pair
cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
cpu-up and cpu-down event at the same time.

Thanks,

    Pingfan
Frederic Weisbecker Sept. 19, 2022, 10:51 a.m. UTC | #4
On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote:
> On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
> <frederic@kernel.org> wrote:
> > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > index 438ecae6bd7e..ef6d3ae239b9 100644
> > > --- a/kernel/rcu/tree_plugin.h
> > > +++ b/kernel/rcu/tree_plugin.h
> > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
> > >  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > >  {
> > >       struct task_struct *t = rnp->boost_kthread_task;
> > > -     unsigned long mask = rcu_rnp_online_cpus(rnp);
> > > +     unsigned long mask;
> > >       cpumask_var_t cm;
> > >       int cpu;
> > >
> > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > >       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> > >               return;
> > >       mutex_lock(&rnp->boost_kthread_mutex);
> > > +     /*
> > > +      * Relying on the lock to serialize, so when onlining, the latest
> > > +      * qsmaskinitnext is for cpus_ptr.
> > > +      */
> > > +     mask = rcu_rnp_online_cpus(rnp);
> > >       for_each_leaf_node_possible_cpu(rnp, cpu)
> > >               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
> > >                   cpu != outgoingcpu)
> >
> > Right but you still race against concurrent rcu_report_dead() doing:
> >
> >       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
> >
> 
> Ah. Indeed, my commit log is not precisely described.
> 
> In fact, either speedup smp_init [1] or fast kexec reboot [2] still
> uses the model: one hotplug initiator and several reactors.  The
> initiators are still excluded from each other by the pair
> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
> cpu-up and cpu-down event at the same time.

Yes but two downing CPUs may race right?

So isn't it possible to have rcu_report_dead() racing against
rcutree_offline_cpu()?

Thanks.
Pingfan Liu Sept. 20, 2022, 3:45 a.m. UTC | #5
On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote:
> On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote:
> > On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
> > <frederic@kernel.org> wrote:
> > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > index 438ecae6bd7e..ef6d3ae239b9 100644
> > > > --- a/kernel/rcu/tree_plugin.h
> > > > +++ b/kernel/rcu/tree_plugin.h
> > > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
> > > >  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > > >  {
> > > >       struct task_struct *t = rnp->boost_kthread_task;
> > > > -     unsigned long mask = rcu_rnp_online_cpus(rnp);
> > > > +     unsigned long mask;
> > > >       cpumask_var_t cm;
> > > >       int cpu;
> > > >
> > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > > >       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> > > >               return;
> > > >       mutex_lock(&rnp->boost_kthread_mutex);
> > > > +     /*
> > > > +      * Relying on the lock to serialize, so when onlining, the latest
> > > > +      * qsmaskinitnext is for cpus_ptr.
> > > > +      */
> > > > +     mask = rcu_rnp_online_cpus(rnp);
> > > >       for_each_leaf_node_possible_cpu(rnp, cpu)
> > > >               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
> > > >                   cpu != outgoingcpu)
> > >
> > > Right but you still race against concurrent rcu_report_dead() doing:
> > >
> > >       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
> > >
> > 
> > Ah. Indeed, my commit log is not precisely described.
> > 
> > In fact, either speedup smp_init [1] or fast kexec reboot [2] still
> > uses the model: one hotplug initiator and several reactors.  The
> > initiators are still excluded from each other by the pair
> > cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
> > cpu-up and cpu-down event at the same time.
> 
> Yes but two downing CPUs may race right?
> 
> So isn't it possible to have rcu_report_dead() racing against
> rcutree_offline_cpu()?
> 

Yes, two downing cpus can be in a batch.

There is a nuance between onlining and offlining:
-1. onlining relies on qsmaskinitnext, which is stable at the point
rcutree_online_cpu(). And it is what this patch aims for.
-2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext
which is not cleared yet at the point rcutree_offline_cpu().

It is asymmetric owning to the point of updating qsmaskinitnext.

Now considering the racing between rcu_report_dead() and
rcutree_offline_cpu():

No matter rcutree_offline_cpu() reads out either stale or fresh value of
qsmaskinitnext, it should intersect with cpu_dying_mask.

Thanks,

	Pingfan
Frederic Weisbecker Sept. 20, 2022, 9:23 a.m. UTC | #6
On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote:
> On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote:
> > On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote:
> > > On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
> > > <frederic@kernel.org> wrote:
> > > > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> > > > > index 438ecae6bd7e..ef6d3ae239b9 100644
> > > > > --- a/kernel/rcu/tree_plugin.h
> > > > > +++ b/kernel/rcu/tree_plugin.h
> > > > > @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
> > > > >  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > > > >  {
> > > > >       struct task_struct *t = rnp->boost_kthread_task;
> > > > > -     unsigned long mask = rcu_rnp_online_cpus(rnp);
> > > > > +     unsigned long mask;
> > > > >       cpumask_var_t cm;
> > > > >       int cpu;
> > > > >
> > > > > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> > > > >       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> > > > >               return;
> > > > >       mutex_lock(&rnp->boost_kthread_mutex);
> > > > > +     /*
> > > > > +      * Relying on the lock to serialize, so when onlining, the latest
> > > > > +      * qsmaskinitnext is for cpus_ptr.
> > > > > +      */
> > > > > +     mask = rcu_rnp_online_cpus(rnp);
> > > > >       for_each_leaf_node_possible_cpu(rnp, cpu)
> > > > >               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
> > > > >                   cpu != outgoingcpu)
> > > >
> > > > Right but you still race against concurrent rcu_report_dead() doing:
> > > >
> > > >       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
> > > >
> > > 
> > > Ah. Indeed, my commit log is not precisely described.
> > > 
> > > In fact, either speedup smp_init [1] or fast kexec reboot [2] still
> > > uses the model: one hotplug initiator and several reactors.  The
> > > initiators are still excluded from each other by the pair
> > > cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
> > > cpu-up and cpu-down event at the same time.
> > 
> > Yes but two downing CPUs may race right?
> > 
> > So isn't it possible to have rcu_report_dead() racing against
> > rcutree_offline_cpu()?
> > 
> 
> Yes, two downing cpus can be in a batch.
> 
> There is a nuance between onlining and offlining:
> -1. onlining relies on qsmaskinitnext, which is stable at the point
> rcutree_online_cpu(). And it is what this patch aims for.
> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext
> which is not cleared yet at the point rcutree_offline_cpu().
> 
> It is asymmetric owning to the point of updating qsmaskinitnext.
> 
> Now considering the racing between rcu_report_dead() and
> rcutree_offline_cpu():
> 
> No matter rcutree_offline_cpu() reads out either stale or fresh value of
> qsmaskinitnext, it should intersect with cpu_dying_mask.

Ah I see now, so the race can happen but it is then cancelled by the
stable READ on cpu_dying_mask. Ok.

> 
> Thanks,
> 
> 	Pingfan
Frederic Weisbecker Sept. 20, 2022, 10:31 a.m. UTC | #7
On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote:
> rcutree_online_cpu() is concurrent, so there is the following race
> scene:
> 
>         CPU 1                               CPU2
> mask_old = rcu_rnp_online_cpus(rnp);
> ...
> 
>                                    mask_new = rcu_rnp_online_cpus(rnp);
>                                    ...
>                                    set_cpus_allowed_ptr(t, cm);
> 
> set_cpus_allowed_ptr(t, cm);

Just for clarity, may I suggest:


          CPU 0                                    CPU 1
          -----                                    -----
       rcutree_online_cpu()
           rcu_boost_kthread_setaffinity()
               mask = rcu_rnp_online_cpus(rnp);
                                                rcu_cpu_starting()
                                                    WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
                                                rcutree_online_cpu()
                                                    rcu_boost_kthread_setaffinity()
                                                        mask = rcu_rnp_online_cpus(rnp);
                                                        set_cpus_allowed_ptr(t, cm);
               set_cpus_allowed_ptr(t, cm);


> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>  	if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
>  		return;
>  	mutex_lock(&rnp->boost_kthread_mutex);
> +	/*
> +	 * Relying on the lock to serialize, so when onlining, the latest
> +	 * qsmaskinitnext is for cpus_ptr.
> +	 */

Given how tricky the situation is, I suggest to also add more details to the
comment:

        /*
         * Rely on the boost mutex to serialize ->qsmaskinitnext and
         * set_cpus_allowed_ptr(), so when concurrently onlining, racing against
	 * rcu_cpu_starting() makes sure that the latest update eventually
	 * wins with the right affinity setting.
         */

Thanks!

> +	mask = rcu_rnp_online_cpus(rnp);
>  	for_each_leaf_node_possible_cpu(rnp, cpu)
>  		if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>  		    cpu != outgoingcpu)
> -- 
> 2.31.1
>
Pingfan Liu Sept. 21, 2022, 11:56 a.m. UTC | #8
On Tue, Sep 20, 2022 at 12:31:43PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 15, 2022 at 01:58:23PM +0800, Pingfan Liu wrote:
> > rcutree_online_cpu() is concurrent, so there is the following race
> > scene:
> > 
> >         CPU 1                               CPU2
> > mask_old = rcu_rnp_online_cpus(rnp);
> > ...
> > 
> >                                    mask_new = rcu_rnp_online_cpus(rnp);
> >                                    ...
> >                                    set_cpus_allowed_ptr(t, cm);
> > 
> > set_cpus_allowed_ptr(t, cm);
> 
> Just for clarity, may I suggest:
> 
> 
>           CPU 0                                    CPU 1
>           -----                                    -----
>        rcutree_online_cpu()
>            rcu_boost_kthread_setaffinity()
>                mask = rcu_rnp_online_cpus(rnp);
>                                                 rcu_cpu_starting()
>                                                     WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
>                                                 rcutree_online_cpu()
>                                                     rcu_boost_kthread_setaffinity()
>                                                         mask = rcu_rnp_online_cpus(rnp);
>                                                         set_cpus_allowed_ptr(t, cm);
>                set_cpus_allowed_ptr(t, cm);
> 
> 

OK.

> > @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> >  	if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> >  		return;
> >  	mutex_lock(&rnp->boost_kthread_mutex);
> > +	/*
> > +	 * Relying on the lock to serialize, so when onlining, the latest
> > +	 * qsmaskinitnext is for cpus_ptr.
> > +	 */
> 
> Given how tricky the situation is, I suggest to also add more details to the
> comment:
> 
>         /*
>          * Rely on the boost mutex to serialize ->qsmaskinitnext and
>          * set_cpus_allowed_ptr(), so when concurrently onlining, racing against
> 	 * rcu_cpu_starting() makes sure that the latest update eventually
> 	 * wins with the right affinity setting.
>          */
> 

Appreciate for your re-phrase. I will use it in the next version


Thanks,
	Pingfan
Joel Fernandes Oct. 1, 2022, 2:26 a.m. UTC | #9
On 9/20/22 05:23, Frederic Weisbecker wrote:
> On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote:
>> On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote:
>>> On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote:
>>>> On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
>>>> <frederic@kernel.org> wrote:
>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644
>>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
>>>>>>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>>>>>>  {
>>>>>>       struct task_struct *t = rnp->boost_kthread_task;
>>>>>> -     unsigned long mask = rcu_rnp_online_cpus(rnp);
>>>>>> +     unsigned long mask;
>>>>>>       cpumask_var_t cm;
>>>>>>       int cpu;
>>>>>>
>>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>>>>>>       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
>>>>>>               return;
>>>>>>       mutex_lock(&rnp->boost_kthread_mutex);
>>>>>> +     /*
>>>>>> +      * Relying on the lock to serialize, so when onlining, the latest
>>>>>> +      * qsmaskinitnext is for cpus_ptr.
>>>>>> +      */
>>>>>> +     mask = rcu_rnp_online_cpus(rnp);
>>>>>>       for_each_leaf_node_possible_cpu(rnp, cpu)
>>>>>>               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>>>>>>                   cpu != outgoingcpu)
>>>>>
>>>>> Right but you still race against concurrent rcu_report_dead() doing:
>>>>>
>>>>>       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
>>>>>
>>>>
>>>> Ah. Indeed, my commit log is not precisely described.
>>>>
>>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still
>>>> uses the model: one hotplug initiator and several reactors.  The
>>>> initiators are still excluded from each other by the pair
>>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
>>>> cpu-up and cpu-down event at the same time.
>>>
>>> Yes but two downing CPUs may race right?
>>>
>>> So isn't it possible to have rcu_report_dead() racing against
>>> rcutree_offline_cpu()?
>>>
>>
>> Yes, two downing cpus can be in a batch.
>>
>> There is a nuance between onlining and offlining:
>> -1. onlining relies on qsmaskinitnext, which is stable at the point
>> rcutree_online_cpu(). And it is what this patch aims for.
>> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext
>> which is not cleared yet at the point rcutree_offline_cpu().
>>
>> It is asymmetric owning to the point of updating qsmaskinitnext.
>>
>> Now considering the racing between rcu_report_dead() and
>> rcutree_offline_cpu():
>>
>> No matter rcutree_offline_cpu() reads out either stale or fresh value of
>> qsmaskinitnext, it should intersect with cpu_dying_mask.
> 
> Ah I see now, so the race can happen but it is then cancelled by the
> stable READ on cpu_dying_mask. Ok.
> 

Maybe I am missing something but, concurrent rcu_report_dead() on CPU's
of the same node can corrupt ->qsmaskinitnext as Frederick showed. That
may not be a problem for affinities (due to the cpu_dying_mask), but is
that not an issue for RCU operation itself?

i.e. due to the corruption, ->qsmaskinitnext is in an inconsistent state
causing *RCU* issues.

Or is there a reason that that is not an issue?

thanks,

 - Joel
Pingfan Liu Oct. 2, 2022, 12:34 p.m. UTC | #10
On Sat, Oct 1, 2022 at 10:26 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
>
>
> On 9/20/22 05:23, Frederic Weisbecker wrote:
> > On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote:
> >> On Mon, Sep 19, 2022 at 12:51:49PM +0200, Frederic Weisbecker wrote:
> >>> On Mon, Sep 19, 2022 at 06:24:21PM +0800, Pingfan Liu wrote:
> >>>> On Fri, Sep 16, 2022 at 10:52 PM Frederic Weisbecker
> >>>> <frederic@kernel.org> wrote:
> >>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644
> >>>>>> --- a/kernel/rcu/tree_plugin.h
> >>>>>> +++ b/kernel/rcu/tree_plugin.h
> >>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
> >>>>>>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> >>>>>>  {
> >>>>>>       struct task_struct *t = rnp->boost_kthread_task;
> >>>>>> -     unsigned long mask = rcu_rnp_online_cpus(rnp);
> >>>>>> +     unsigned long mask;
> >>>>>>       cpumask_var_t cm;
> >>>>>>       int cpu;
> >>>>>>
> >>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
> >>>>>>       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
> >>>>>>               return;
> >>>>>>       mutex_lock(&rnp->boost_kthread_mutex);
> >>>>>> +     /*
> >>>>>> +      * Relying on the lock to serialize, so when onlining, the latest
> >>>>>> +      * qsmaskinitnext is for cpus_ptr.
> >>>>>> +      */
> >>>>>> +     mask = rcu_rnp_online_cpus(rnp);
> >>>>>>       for_each_leaf_node_possible_cpu(rnp, cpu)
> >>>>>>               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
> >>>>>>                   cpu != outgoingcpu)
> >>>>>
> >>>>> Right but you still race against concurrent rcu_report_dead() doing:
> >>>>>
> >>>>>       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
> >>>>>
> >>>>
> >>>> Ah. Indeed, my commit log is not precisely described.
> >>>>
> >>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still
> >>>> uses the model: one hotplug initiator and several reactors.  The
> >>>> initiators are still excluded from each other by the pair
> >>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
> >>>> cpu-up and cpu-down event at the same time.
> >>>
> >>> Yes but two downing CPUs may race right?
> >>>
> >>> So isn't it possible to have rcu_report_dead() racing against
> >>> rcutree_offline_cpu()?
> >>>
> >>
> >> Yes, two downing cpus can be in a batch.
> >>
> >> There is a nuance between onlining and offlining:
> >> -1. onlining relies on qsmaskinitnext, which is stable at the point
> >> rcutree_online_cpu(). And it is what this patch aims for.
> >> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext
> >> which is not cleared yet at the point rcutree_offline_cpu().
> >>
> >> It is asymmetric owning to the point of updating qsmaskinitnext.
> >>
> >> Now considering the racing between rcu_report_dead() and
> >> rcutree_offline_cpu():
> >>
> >> No matter rcutree_offline_cpu() reads out either stale or fresh value of
> >> qsmaskinitnext, it should intersect with cpu_dying_mask.
> >
> > Ah I see now, so the race can happen but it is then cancelled by the
> > stable READ on cpu_dying_mask. Ok.
> >
>
> Maybe I am missing something but, concurrent rcu_report_dead() on CPU's
> of the same node can corrupt ->qsmaskinitnext as Frederick showed. That

There is a pair of lock ops around WRITE_ONCE(rnp->qsmaskinitnext,
rnp->qsmaskinitnext & ~mask), i.e. raw_spin_lock_irqsave_rcu_node(rnp,
flags) and raw_spin_unlock_irqrestore_rcu_node(rnp, flags);

So it should not be a problem, right?

Thanks,

    Pingfan
Joel Fernandes Oct. 2, 2022, 3:52 p.m. UTC | #11
On 10/2/2022 8:34 AM, Pingfan Liu wrote:
> On Sat, Oct 1, 2022 at 10:26 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>> On 9/20/22 05:23, Frederic Weisbecker wrote:
>>> On Tue, Sep 20, 2022 at 11:45:45AM +0800, Pingfan Liu wrote:
[...]
>>>>>> <frederic@kernel.org> wrote:
>>>>>>>> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
>>>>>>>> index 438ecae6bd7e..ef6d3ae239b9 100644
>>>>>>>> --- a/kernel/rcu/tree_plugin.h
>>>>>>>> +++ b/kernel/rcu/tree_plugin.h
>>>>>>>> @@ -1224,7 +1224,7 @@ static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
>>>>>>>>  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>>>>>>>>  {
>>>>>>>>       struct task_struct *t = rnp->boost_kthread_task;
>>>>>>>> -     unsigned long mask = rcu_rnp_online_cpus(rnp);
>>>>>>>> +     unsigned long mask;
>>>>>>>>       cpumask_var_t cm;
>>>>>>>>       int cpu;
>>>>>>>>
>>>>>>>> @@ -1233,6 +1233,11 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
>>>>>>>>       if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
>>>>>>>>               return;
>>>>>>>>       mutex_lock(&rnp->boost_kthread_mutex);
>>>>>>>> +     /*
>>>>>>>> +      * Relying on the lock to serialize, so when onlining, the latest
>>>>>>>> +      * qsmaskinitnext is for cpus_ptr.
>>>>>>>> +      */
>>>>>>>> +     mask = rcu_rnp_online_cpus(rnp);
>>>>>>>>       for_each_leaf_node_possible_cpu(rnp, cpu)
>>>>>>>>               if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
>>>>>>>>                   cpu != outgoingcpu)
>>>>>>>
>>>>>>> Right but you still race against concurrent rcu_report_dead() doing:
>>>>>>>
>>>>>>>       WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext & ~mask)
>>>>>>>
>>>>>>
>>>>>> Ah. Indeed, my commit log is not precisely described.
>>>>>>
>>>>>> In fact, either speedup smp_init [1] or fast kexec reboot [2] still
>>>>>> uses the model: one hotplug initiator and several reactors.  The
>>>>>> initiators are still excluded from each other by the pair
>>>>>> cpu_maps_update_begin()/cpu_maps_update_done(). So there can not be a
>>>>>> cpu-up and cpu-down event at the same time.
>>>>>
>>>>> Yes but two downing CPUs may race right?
>>>>>
>>>>> So isn't it possible to have rcu_report_dead() racing against
>>>>> rcutree_offline_cpu()?
>>>>>
>>>>
>>>> Yes, two downing cpus can be in a batch.
>>>>
>>>> There is a nuance between onlining and offlining:
>>>> -1. onlining relies on qsmaskinitnext, which is stable at the point
>>>> rcutree_online_cpu(). And it is what this patch aims for.
>>>> -2. offlining relies on cpu_dying_mask [2/3], instead of qsmaskinitnext
>>>> which is not cleared yet at the point rcutree_offline_cpu().
>>>>
>>>> It is asymmetric owning to the point of updating qsmaskinitnext.
>>>>
>>>> Now considering the racing between rcu_report_dead() and
>>>> rcutree_offline_cpu():
>>>>
>>>> No matter rcutree_offline_cpu() reads out either stale or fresh value of
>>>> qsmaskinitnext, it should intersect with cpu_dying_mask.
>>>
>>> Ah I see now, so the race can happen but it is then cancelled by the
>>> stable READ on cpu_dying_mask. Ok.
>>>
>>
>> Maybe I am missing something but, concurrent rcu_report_dead() on CPU's
>> of the same node can corrupt ->qsmaskinitnext as Frederick showed. That
> 
> There is a pair of lock ops around WRITE_ONCE(rnp->qsmaskinitnext,
> rnp->qsmaskinitnext & ~mask), i.e. raw_spin_lock_irqsave_rcu_node(rnp,
> flags) and raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> 
> So it should not be a problem, right?

Correct, I think should not be problem. I got thrown away a bit by the comment
"Right but you still race against concurrent rcu_report_dead()".

Even with concurrent offlining, a stable ->qsmaskinitnext should ensure that the
hotplugged CPUs are properly considered in rcu_gp_init(), so I don't see a
problem with your patch so far... but I'll continue to dig when you send the
next patch. My fear was if a bit in ->qsmaskinitnext is not cleared during
offlining race, then grace period will hang.

Thanks,

 - Joel
diff mbox series

Patch

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 438ecae6bd7e..ef6d3ae239b9 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1224,7 +1224,7 @@  static void rcu_spawn_one_boost_kthread(struct rcu_node *rnp)
 static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
 {
 	struct task_struct *t = rnp->boost_kthread_task;
-	unsigned long mask = rcu_rnp_online_cpus(rnp);
+	unsigned long mask;
 	cpumask_var_t cm;
 	int cpu;
 
@@ -1233,6 +1233,11 @@  static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
 	if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
 		return;
 	mutex_lock(&rnp->boost_kthread_mutex);
+	/*
+	 * Relying on the lock to serialize, so when onlining, the latest
+	 * qsmaskinitnext is for cpus_ptr.
+	 */
+	mask = rcu_rnp_online_cpus(rnp);
 	for_each_leaf_node_possible_cpu(rnp, cpu)
 		if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
 		    cpu != outgoingcpu)