diff mbox series

rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle()

Message ID 20230824084206.22844-1-qiang.zhang1211@gmail.com (mailing list archive)
State Accepted
Commit ed763051f63059bdb6d728c0890993eab8833feb
Headers show
Series rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle() | expand

Commit Message

Z qiang Aug. 24, 2023, 8:42 a.m. UTC
Currently, the maxcpu is set by traversing online CPUs, however, if
the rcutorture.onoff_holdoff is set zero and onoff_interval is set
non-zero, and the some CPUs with larger cpuid has been offline before
setting maxcpu, for these CPUs, even if they are online again, also
cannot be offload or deoffload.

This commit therefore use for_each_possible_cpu() instead of
for_each_online_cpu() in rcu_nocb_toggle().

Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
---
 kernel/rcu/rcutorture.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Z qiang Aug. 24, 2023, 8:59 a.m. UTC | #1
Sorry for the repeat sending.


>
> Currently, the maxcpu is set by traversing online CPUs, however, if
> the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> non-zero, and the some CPUs with larger cpuid has been offline before
> setting maxcpu, for these CPUs, even if they are online again, also
> cannot be offload or deoffload.
>
> This commit therefore use for_each_possible_cpu() instead of
> for_each_online_cpu() in rcu_nocb_toggle().
>
> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> ---
>  kernel/rcu/rcutorture.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index a58372bdf0c1..b75d0fe558ce 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
>         VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
>         while (!rcu_inkernel_boot_has_ended())
>                 schedule_timeout_interruptible(HZ / 10);
> -       for_each_online_cpu(cpu)
> +       for_each_possible_cpu(cpu)
>                 maxcpu = cpu;
>         WARN_ON(maxcpu < 0);
>         if (toggle_interval > ULONG_MAX)
> --
> 2.17.1
>
Paul E. McKenney Aug. 24, 2023, 1:21 p.m. UTC | #2
On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> Currently, the maxcpu is set by traversing online CPUs, however, if
> the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> non-zero, and the some CPUs with larger cpuid has been offline before
> setting maxcpu, for these CPUs, even if they are online again, also
> cannot be offload or deoffload.
> 
> This commit therefore use for_each_possible_cpu() instead of
> for_each_online_cpu() in rcu_nocb_toggle().
> 
> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> ---
>  kernel/rcu/rcutorture.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index a58372bdf0c1..b75d0fe558ce 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
>  	VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
>  	while (!rcu_inkernel_boot_has_ended())
>  		schedule_timeout_interruptible(HZ / 10);
> -	for_each_online_cpu(cpu)
> +	for_each_possible_cpu(cpu)

Last I checked, bad things could happen if the code attempted to
nocb_toggle a CPU that had not yet come online.  Has that changed?

							Thanx, Paul

>  		maxcpu = cpu;
>  	WARN_ON(maxcpu < 0);
>  	if (toggle_interval > ULONG_MAX)
> -- 
> 2.17.1
>
Z qiang Aug. 25, 2023, 2:28 a.m. UTC | #3
>
> On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> > Currently, the maxcpu is set by traversing online CPUs, however, if
> > the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> > non-zero, and the some CPUs with larger cpuid has been offline before
> > setting maxcpu, for these CPUs, even if they are online again, also
> > cannot be offload or deoffload.
> >
> > This commit therefore use for_each_possible_cpu() instead of
> > for_each_online_cpu() in rcu_nocb_toggle().
> >
> > Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> > ---
> >  kernel/rcu/rcutorture.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > index a58372bdf0c1..b75d0fe558ce 100644
> > --- a/kernel/rcu/rcutorture.c
> > +++ b/kernel/rcu/rcutorture.c
> > @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
> >       VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
> >       while (!rcu_inkernel_boot_has_ended())
> >               schedule_timeout_interruptible(HZ / 10);
> > -     for_each_online_cpu(cpu)
> > +     for_each_possible_cpu(cpu)
>
> Last I checked, bad things could happen if the code attempted to
> nocb_toggle a CPU that had not yet come online.  Has that changed?

For example, there are 8 online CPUs in the system, before we traversing online
CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
for CPU7(maxcpu=6)

Even though we still use for_each_online_cpu(), the things described
above also happen.  before we toggle the CPU, this CPU has been offline.


Thanks
Zqiang


>
>                                                         Thanx, Paul
>
> >               maxcpu = cpu;
> >       WARN_ON(maxcpu < 0);
> >       if (toggle_interval > ULONG_MAX)
> > --
> > 2.17.1
> >
Paul E. McKenney Aug. 26, 2023, 1:29 a.m. UTC | #4
On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
> >
> > On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> > > Currently, the maxcpu is set by traversing online CPUs, however, if
> > > the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> > > non-zero, and the some CPUs with larger cpuid has been offline before
> > > setting maxcpu, for these CPUs, even if they are online again, also
> > > cannot be offload or deoffload.
> > >
> > > This commit therefore use for_each_possible_cpu() instead of
> > > for_each_online_cpu() in rcu_nocb_toggle().
> > >
> > > Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> > > ---
> > >  kernel/rcu/rcutorture.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > > index a58372bdf0c1..b75d0fe558ce 100644
> > > --- a/kernel/rcu/rcutorture.c
> > > +++ b/kernel/rcu/rcutorture.c
> > > @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
> > >       VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
> > >       while (!rcu_inkernel_boot_has_ended())
> > >               schedule_timeout_interruptible(HZ / 10);
> > > -     for_each_online_cpu(cpu)
> > > +     for_each_possible_cpu(cpu)
> >
> > Last I checked, bad things could happen if the code attempted to
> > nocb_toggle a CPU that had not yet come online.  Has that changed?
> 
> For example, there are 8 online CPUs in the system, before we traversing online
> CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
> for CPU7(maxcpu=6)
> 
> Even though we still use for_each_online_cpu(), the things described
> above also happen.  before we toggle the CPU, this CPU has been offline.

Suppose we have a system whose possible CPUs are 0, 1, 2, and 3.  However,
only 0 and 1 are present in this system, and until some manual action is
taken, only 0 and 1 will ever be online.  (Yes, this really can happen!)
In that state, won't toggling CPU 2 and 3 result in failures?

							Thanx, Paul

> Thanks
> Zqiang
> 
> 
> >
> >                                                         Thanx, Paul
> >
> > >               maxcpu = cpu;
> > >       WARN_ON(maxcpu < 0);
> > >       if (toggle_interval > ULONG_MAX)
> > > --
> > > 2.17.1
> > >
Z qiang Aug. 26, 2023, 6:13 a.m. UTC | #5
>
> On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
> > >
> > > On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> > > > Currently, the maxcpu is set by traversing online CPUs, however, if
> > > > the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> > > > non-zero, and the some CPUs with larger cpuid has been offline before
> > > > setting maxcpu, for these CPUs, even if they are online again, also
> > > > cannot be offload or deoffload.
> > > >
> > > > This commit therefore use for_each_possible_cpu() instead of
> > > > for_each_online_cpu() in rcu_nocb_toggle().
> > > >
> > > > Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> > > > ---
> > > >  kernel/rcu/rcutorture.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > > > index a58372bdf0c1..b75d0fe558ce 100644
> > > > --- a/kernel/rcu/rcutorture.c
> > > > +++ b/kernel/rcu/rcutorture.c
> > > > @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
> > > >       VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
> > > >       while (!rcu_inkernel_boot_has_ended())
> > > >               schedule_timeout_interruptible(HZ / 10);
> > > > -     for_each_online_cpu(cpu)
> > > > +     for_each_possible_cpu(cpu)
> > >
> > > Last I checked, bad things could happen if the code attempted to
> > > nocb_toggle a CPU that had not yet come online.  Has that changed?
> >
> > For example, there are 8 online CPUs in the system, before we traversing online
> > CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
> > for CPU7(maxcpu=6)
> >
> > Even though we still use for_each_online_cpu(), the things described
> > above also happen.  before we toggle the CPU, this CPU has been offline.
>
> Suppose we have a system whose possible CPUs are 0, 1, 2, and 3.  However,
> only 0 and 1 are present in this system, and until some manual action is
> taken, only 0 and 1 will ever be online.  (Yes, this really can happen!)
> In that state, won't toggling CPU 2 and 3 result in failures?
>

Agree.
As long as we enabled rcutorture.onoff_interval,  regardless of whether we use
online CPUs or possible CPUs to set maxcpu,  It is all possible to
toggling the CPUs failure
and print "NOCB: Cannot CB-offload offline CPU" log. but the failures
due to CPU offline are acceptable.

but at least the toggling operation on CPU7 will not be missed. when
CPU7 comes online again.

Would it be better to use for_each_present_cpu() ?

Thanks
Zqiang

>
>                                                         Thanx, Paul
>
> > Thanks
> > Zqiang
> >
> >
> > >
> > >                                                         Thanx, Paul
> > >
> > > >               maxcpu = cpu;
> > > >       WARN_ON(maxcpu < 0);
> > > >       if (toggle_interval > ULONG_MAX)
> > > > --
> > > > 2.17.1
> > > >
Paul E. McKenney Aug. 26, 2023, 1:06 p.m. UTC | #6
On Sat, Aug 26, 2023 at 02:13:39PM +0800, Z qiang wrote:
> >
> > On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
> > > >
> > > > On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> > > > > Currently, the maxcpu is set by traversing online CPUs, however, if
> > > > > the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> > > > > non-zero, and the some CPUs with larger cpuid has been offline before
> > > > > setting maxcpu, for these CPUs, even if they are online again, also
> > > > > cannot be offload or deoffload.
> > > > >
> > > > > This commit therefore use for_each_possible_cpu() instead of
> > > > > for_each_online_cpu() in rcu_nocb_toggle().
> > > > >
> > > > > Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> > > > > ---
> > > > >  kernel/rcu/rcutorture.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > > > > index a58372bdf0c1..b75d0fe558ce 100644
> > > > > --- a/kernel/rcu/rcutorture.c
> > > > > +++ b/kernel/rcu/rcutorture.c
> > > > > @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
> > > > >       VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
> > > > >       while (!rcu_inkernel_boot_has_ended())
> > > > >               schedule_timeout_interruptible(HZ / 10);
> > > > > -     for_each_online_cpu(cpu)
> > > > > +     for_each_possible_cpu(cpu)
> > > >
> > > > Last I checked, bad things could happen if the code attempted to
> > > > nocb_toggle a CPU that had not yet come online.  Has that changed?
> > >
> > > For example, there are 8 online CPUs in the system, before we traversing online
> > > CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
> > > for CPU7(maxcpu=6)
> > >
> > > Even though we still use for_each_online_cpu(), the things described
> > > above also happen.  before we toggle the CPU, this CPU has been offline.
> >
> > Suppose we have a system whose possible CPUs are 0, 1, 2, and 3.  However,
> > only 0 and 1 are present in this system, and until some manual action is
> > taken, only 0 and 1 will ever be online.  (Yes, this really can happen!)
> > In that state, won't toggling CPU 2 and 3 result in failures?
> >
> 
> Agree.
> As long as we enabled rcutorture.onoff_interval,  regardless of whether we use
> online CPUs or possible CPUs to set maxcpu,  It is all possible to
> toggling the CPUs failure
> and print "NOCB: Cannot CB-offload offline CPU" log. but the failures
> due to CPU offline are acceptable.
> 
> but at least the toggling operation on CPU7 will not be missed. when
> CPU7 comes online again.
> 
> Would it be better to use for_each_present_cpu() ?

The problem we face is that RCU and rcutorture have no reasonable way
of knowing when the boot-time CPU bringup has completed.  If there was a
way of knowing that, then my approach would be to make rcutorture react
to a holdoff of zero by waiting for all the CPUs to come online.

Failing that, for_each_present_cpu() with a holdoff of zero will likely
get us transient failures between the time rcutorture starts and the
last CPU has come online.

Or is there now a way for in-kernel code know when boot-time CPU onlining
has completed?

							Thanx, Paul

> Thanks
> Zqiang
> 
> >
> >                                                         Thanx, Paul
> >
> > > Thanks
> > > Zqiang
> > >
> > >
> > > >
> > > >                                                         Thanx, Paul
> > > >
> > > > >               maxcpu = cpu;
> > > > >       WARN_ON(maxcpu < 0);
> > > > >       if (toggle_interval > ULONG_MAX)
> > > > > --
> > > > > 2.17.1
> > > > >
Frederic Weisbecker Aug. 28, 2023, 3:17 p.m. UTC | #7
Le Sat, Aug 26, 2023 at 06:06:20AM -0700, Paul E. McKenney a écrit :
> On Sat, Aug 26, 2023 at 02:13:39PM +0800, Z qiang wrote:
> > >
> > > On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
> > > > >
> > > > > On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
> > > > > > Currently, the maxcpu is set by traversing online CPUs, however, if
> > > > > > the rcutorture.onoff_holdoff is set zero and onoff_interval is set
> > > > > > non-zero, and the some CPUs with larger cpuid has been offline before
> > > > > > setting maxcpu, for these CPUs, even if they are online again, also
> > > > > > cannot be offload or deoffload.
> > > > > >
> > > > > > This commit therefore use for_each_possible_cpu() instead of
> > > > > > for_each_online_cpu() in rcu_nocb_toggle().
> > > > > >
> > > > > > Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
> > > > > > ---
> > > > > >  kernel/rcu/rcutorture.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> > > > > > index a58372bdf0c1..b75d0fe558ce 100644
> > > > > > --- a/kernel/rcu/rcutorture.c
> > > > > > +++ b/kernel/rcu/rcutorture.c
> > > > > > @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
> > > > > >       VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
> > > > > >       while (!rcu_inkernel_boot_has_ended())
> > > > > >               schedule_timeout_interruptible(HZ / 10);
> > > > > > -     for_each_online_cpu(cpu)
> > > > > > +     for_each_possible_cpu(cpu)
> > > > >
> > > > > Last I checked, bad things could happen if the code attempted to
> > > > > nocb_toggle a CPU that had not yet come online.  Has that changed?
> > > >
> > > > For example, there are 8 online CPUs in the system, before we traversing online
> > > > CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
> > > > for CPU7(maxcpu=6)
> > > >
> > > > Even though we still use for_each_online_cpu(), the things described
> > > > above also happen.  before we toggle the CPU, this CPU has been offline.
> > >
> > > Suppose we have a system whose possible CPUs are 0, 1, 2, and 3.  However,
> > > only 0 and 1 are present in this system, and until some manual action is
> > > taken, only 0 and 1 will ever be online.  (Yes, this really can happen!)
> > > In that state, won't toggling CPU 2 and 3 result in failures?
> > >
> > 
> > Agree.
> > As long as we enabled rcutorture.onoff_interval,  regardless of whether we use
> > online CPUs or possible CPUs to set maxcpu,  It is all possible to
> > toggling the CPUs failure
> > and print "NOCB: Cannot CB-offload offline CPU" log. but the failures
> > due to CPU offline are acceptable.
> > 
> > but at least the toggling operation on CPU7 will not be missed. when
> > CPU7 comes online again.
> > 
> > Would it be better to use for_each_present_cpu() ?
> 
> The problem we face is that RCU and rcutorture have no reasonable way
> of knowing when the boot-time CPU bringup has completed.  If there was a
> way of knowing that, then my approach would be to make rcutorture react
> to a holdoff of zero by waiting for all the CPUs to come online.
> 
> Failing that, for_each_present_cpu() with a holdoff of zero will likely
> get us transient failures between the time rcutorture starts and the
> last CPU has come online.
> 
> Or is there now a way for in-kernel code know when boot-time CPU onlining
> has completed?

We don't need to wait for all CPUs to be online though. Toggling
already handles well failures due to offline CPUs and since toggling
happens concurrently with offlining in rcutorture, we already see lots
of failures reported in the logs.

Thanks.
Joel Fernandes Aug. 28, 2023, 9:51 p.m. UTC | #8
> On Aug 28, 2023, at 11:17 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> Le Sat, Aug 26, 2023 at 06:06:20AM -0700, Paul E. McKenney a écrit :
>> On Sat, Aug 26, 2023 at 02:13:39PM +0800, Z qiang wrote:
>>>> 
>>>> On Fri, Aug 25, 2023 at 10:28:37AM +0800, Z qiang wrote:
>>>>>> 
>>>>>>> On Thu, Aug 24, 2023 at 04:42:06PM +0800, Zqiang wrote:
>>>>>>>> Currently, the maxcpu is set by traversing online CPUs, however, if
>>>>>>>> the rcutorture.onoff_holdoff is set zero and onoff_interval is set
>>>>>>>> non-zero, and the some CPUs with larger cpuid has been offline before
>>>>>>>> setting maxcpu, for these CPUs, even if they are online again, also
>>>>>>>> cannot be offload or deoffload.
>>>>>>>> 
>>>>>>>> This commit therefore use for_each_possible_cpu() instead of
>>>>>>>> for_each_online_cpu() in rcu_nocb_toggle().
>>>>>>>> 
>>>>>>>> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com>
>>>>>>>> ---
>>>>>>>> kernel/rcu/rcutorture.c | 2 +-
>>>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>> 
>>>>>>>> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
>>>>>>>> index a58372bdf0c1..b75d0fe558ce 100644
>>>>>>>> --- a/kernel/rcu/rcutorture.c
>>>>>>>> +++ b/kernel/rcu/rcutorture.c
>>>>>>>> @@ -2131,7 +2131,7 @@ static int rcu_nocb_toggle(void *arg)
>>>>>>>>      VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
>>>>>>>>      while (!rcu_inkernel_boot_has_ended())
>>>>>>>>              schedule_timeout_interruptible(HZ / 10);
>>>>>>>> -     for_each_online_cpu(cpu)
>>>>>>>> +     for_each_possible_cpu(cpu)
>>>>>>> 
>>>>>>> Last I checked, bad things could happen if the code attempted to
>>>>>>> nocb_toggle a CPU that had not yet come online.  Has that changed?
>>>>>> 
>>>>>> For example, there are 8 online CPUs in the system, before we traversing online
>>>>>> CPUs and set maxcpu,  CPU7 has been offline, this causes us to miss nocb_toggle
>>>>>> for CPU7(maxcpu=6)
>>>>>> 
>>>>>> Even though we still use for_each_online_cpu(), the things described
>>>>>> above also happen.  before we toggle the CPU, this CPU has been offline.
>>>>> 
>>>>> Suppose we have a system whose possible CPUs are 0, 1, 2, and 3.  However,
>>>>> only 0 and 1 are present in this system, and until some manual action is
>>>>> taken, only 0 and 1 will ever be online.  (Yes, this really can happen!)
>>>>> In that state, won't toggling CPU 2 and 3 result in failures?
>>>>> 
>>> 
>>> Agree.
>>> As long as we enabled rcutorture.onoff_interval,  regardless of whether we use
>>> online CPUs or possible CPUs to set maxcpu,  It is all possible to
>>> toggling the CPUs failure
>>> and print "NOCB: Cannot CB-offload offline CPU" log. but the failures
>>> due to CPU offline are acceptable.
>>> 
>>> but at least the toggling operation on CPU7 will not be missed. when
>>> CPU7 comes online again.
>>> 
>>> Would it be better to use for_each_present_cpu() ?
>> 
>> The problem we face is that RCU and rcutorture have no reasonable way
>> of knowing when the boot-time CPU bringup has completed.  If there was a
>> way of knowing that, then my approach would be to make rcutorture react
>> to a holdoff of zero by waiting for all the CPUs to come online.
>> 
>> Failing that, for_each_present_cpu() with a holdoff of zero will likely
>> get us transient failures between the time rcutorture starts and the
>> last CPU has come online.
>> 
>> Or is there now a way for in-kernel code know when boot-time CPU onlining
>> has completed?
> 
> We don't need to wait for all CPUs to be online though. Toggling
> already handles well failures due to offline CPUs and since toggling
> happens concurrently with offlining in rcutorture, we already see lots
> of failures reported in the logs.

I think the issue is the loop later in the function does
not try to toggle cpus that came online too late.

So it does not test offloading on all CPUs just because max got updated too late.

One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
ever got onlined. If it did, update the maxcpu.

Thanks.


> 
> Thanks.
Frederic Weisbecker Aug. 29, 2023, 9:24 a.m. UTC | #9
On Mon, Aug 28, 2023 at 05:51:09PM -0400, Joel Fernandes wrote:
> I think the issue is the loop later in the function does
> not try to toggle cpus that came online too late.
> 
> So it does not test offloading on all CPUs just because max got updated too
> late.

Right, and therefore for_each_possible_cpu() or for_each_present_cpu()
should be fine to iterate since it's ok to try to toggle an offline CPU.

> 
> One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
> ever got onlined. If it did, update the maxcpu.

Is it worth the complication though?

Thanks.

> 
> Thanks.
> 
> 
> > 
> > Thanks.
Joel Fernandes Aug. 29, 2023, 10:12 a.m. UTC | #10
> On Aug 29, 2023, at 5:24 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Mon, Aug 28, 2023 at 05:51:09PM -0400, Joel Fernandes wrote:
>> I think the issue is the loop later in the function does
>> not try to toggle cpus that came online too late.
>> 
>> So it does not test offloading on all CPUs just because max got updated too
>> late.
> 
> Right, and therefore for_each_possible_cpu() or for_each_present_cpu()
> should be fine to iterate since it's ok to try to toggle an offline CPU.

Ah I see what you mean, sounds good.

> 
>> 
>> One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
>> ever got onlined. If it did, update the maxcpu.
> 
> Is it worth the complication though?

Probably not and so your suggestion sounds fine.

Thanks!

 - Joel


> 
> Thanks.
> 
>> 
>> Thanks.
>> 
>> 
>>> 
>>> Thanks.
Paul E. McKenney Aug. 29, 2023, 12:38 p.m. UTC | #11
On Tue, Aug 29, 2023 at 11:24:24AM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 28, 2023 at 05:51:09PM -0400, Joel Fernandes wrote:
> > I think the issue is the loop later in the function does
> > not try to toggle cpus that came online too late.
> > 
> > So it does not test offloading on all CPUs just because max got updated too
> > late.
> 
> Right, and therefore for_each_possible_cpu() or for_each_present_cpu()
> should be fine to iterate since it's ok to try to toggle an offline CPU.

OK, so I will accept the original patch which did just that.

Thank you!

I recently got burned by lack of workqueues on a not-yet-onlined CPU,
hence my questions here.  ;-)

							Thanx, Paul

> > One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
> > ever got onlined. If it did, update the maxcpu.
> 
> Is it worth the complication though?
> 
> Thanks.
> 
> > 
> > Thanks.
> > 
> > 
> > > 
> > > Thanks.
Paul E. McKenney Sept. 1, 2023, 2:56 p.m. UTC | #12
On Tue, Aug 29, 2023 at 05:38:54AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 29, 2023 at 11:24:24AM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 28, 2023 at 05:51:09PM -0400, Joel Fernandes wrote:
> > > I think the issue is the loop later in the function does
> > > not try to toggle cpus that came online too late.
> > > 
> > > So it does not test offloading on all CPUs just because max got updated too
> > > late.
> > 
> > Right, and therefore for_each_possible_cpu() or for_each_present_cpu()
> > should be fine to iterate since it's ok to try to toggle an offline CPU.
> 
> OK, so I will accept the original patch which did just that.

Which I finally did.  I took the liberty of adding Frederic's Reviewed-by,
but please let me know if this is in any way inappropriate.

							Thanx, Paul

> Thank you!
> 
> I recently got burned by lack of workqueues on a not-yet-onlined CPU,
> hence my questions here.  ;-)
> 
> 							Thanx, Paul
> 
> > > One fix could be to periodically check in the loop if a new cpu at maxcpu + 1
> > > ever got onlined. If it did, update the maxcpu.
> > 
> > Is it worth the complication though?
> > 
> > Thanks.
> > 
> > > 
> > > Thanks.
> > > 
> > > 
> > > > 
> > > > Thanks.
diff mbox series

Patch

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index a58372bdf0c1..b75d0fe558ce 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -2131,7 +2131,7 @@  static int rcu_nocb_toggle(void *arg)
 	VERBOSE_TOROUT_STRING("rcu_nocb_toggle task started");
 	while (!rcu_inkernel_boot_has_ended())
 		schedule_timeout_interruptible(HZ / 10);
-	for_each_online_cpu(cpu)
+	for_each_possible_cpu(cpu)
 		maxcpu = cpu;
 	WARN_ON(maxcpu < 0);
 	if (toggle_interval > ULONG_MAX)