diff mbox series

[RFC,01/49] xen/sched: call cpu_disable_scheduler() via cpu notifier

Message ID 20190329150934.17694-2-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß March 29, 2019, 3:08 p.m. UTC
cpu_disable_scheduler() is being called from __cpu_disable() today.
There is no need to execute it on the cpu just being disabled, so use
the CPU_DEAD case of the cpu notifier chain. Moving the call out of
stop_machine() context is fine, as we just need to hold the domain RCU
lock and need the scheduler percpu data to be still allocated.

Add another hook for CPU_DOWN_PREPARE to bail out early in case
cpu_disable_scheduler() would fail. This will avoid crashes in rare
cases for cpu hotplug or suspend.

While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
incarnation.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V2:
- add CPU_DOWN_PREPARE hook
- BUG() in case of cpu_disable_scheduler() failing in CPU_DEAD
  (Jan Beulich)
- modify ARM __cpu_disable(), too (Andrew Cooper)
---
 xen/arch/arm/smpboot.c |  4 ----
 xen/arch/x86/smpboot.c |  3 ---
 xen/common/schedule.c  | 42 +++++++++++++++++++++++++++++++++++-------
 3 files changed, 35 insertions(+), 14 deletions(-)

Comments

Julien Grall April 1, 2019, 9:21 a.m. UTC | #1
Hi,

On 3/29/19 3:08 PM, Juergen Gross wrote:
> cpu_disable_scheduler() is being called from __cpu_disable() today.
> There is no need to execute it on the cpu just being disabled, so use
> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
> stop_machine() context is fine, as we just need to hold the domain RCU
> lock and need the scheduler percpu data to be still allocated.
> 
> Add another hook for CPU_DOWN_PREPARE to bail out early in case
> cpu_disable_scheduler() would fail. This will avoid crashes in rare
> cases for cpu hotplug or suspend.
> 
> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
> incarnation.

This is not obvious why the smp_mb() is superfluous. Can you please 
provide more details on why this is not necessary?

Cheers,
Jürgen Groß April 1, 2019, 9:40 a.m. UTC | #2
On 01/04/2019 11:21, Julien Grall wrote:
> Hi,
> 
> On 3/29/19 3:08 PM, Juergen Gross wrote:
>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>> There is no need to execute it on the cpu just being disabled, so use
>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>> stop_machine() context is fine, as we just need to hold the domain RCU
>> lock and need the scheduler percpu data to be still allocated.
>>
>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>> cases for cpu hotplug or suspend.
>>
>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>> incarnation.
> 
> This is not obvious why the smp_mb() is superfluous. Can you please
> provide more details on why this is not necessary?

cpumask_clear_cpu() should already have the needed semantics, no?
It is based on clear_bit() which is defined to be atomic.


Juergen
Julien Grall April 1, 2019, 10:29 a.m. UTC | #3
Hi,

On 4/1/19 10:40 AM, Juergen Gross wrote:
> On 01/04/2019 11:21, Julien Grall wrote:
>> Hi,
>>
>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>> There is no need to execute it on the cpu just being disabled, so use
>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>> stop_machine() context is fine, as we just need to hold the domain RCU
>>> lock and need the scheduler percpu data to be still allocated.
>>>
>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>> cases for cpu hotplug or suspend.
>>>
>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>>> incarnation.
>>
>> This is not obvious why the smp_mb() is superfluous. Can you please
>> provide more details on why this is not necessary?
> 
> cpumask_clear_cpu() should already have the needed semantics, no?
> It is based on clear_bit() which is defined to be atomic.

atomicity does not mean the store/load cannot be re-ordered by the CPU. 
You would need a barrier to prevent re-ordering.

cpumask_clear_cpu() and clear_bit() does not contain any barrier, so 
store/load can be re-ordered.

I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are 
no documentation in the code why the barrier is here. The logs don't 
help either.

The barrier here will ensure that the load/store related to disabling 
the CPU are seen before any load/store happening after the return. 
Although, I am not sure why this is necessary.

Stefano, Do you remember the rationale?

Cheers,
Jürgen Groß April 1, 2019, 10:37 a.m. UTC | #4
On 01/04/2019 12:29, Julien Grall wrote:
> Hi,
> 
> On 4/1/19 10:40 AM, Juergen Gross wrote:
>> On 01/04/2019 11:21, Julien Grall wrote:
>>> Hi,
>>>
>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>> There is no need to execute it on the cpu just being disabled, so use
>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>> stop_machine() context is fine, as we just need to hold the domain RCU
>>>> lock and need the scheduler percpu data to be still allocated.
>>>>
>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>> cases for cpu hotplug or suspend.
>>>>
>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>>>> incarnation.
>>>
>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>> provide more details on why this is not necessary?
>>
>> cpumask_clear_cpu() should already have the needed semantics, no?
>> It is based on clear_bit() which is defined to be atomic.
> 
> atomicity does not mean the store/load cannot be re-ordered by the CPU.
> You would need a barrier to prevent re-ordering.
> 
> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
> store/load can be re-ordered.

Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
there suggests the sequence of setting the blocked bit and doing the
test is important for avoiding a race...


Juergen

> 
> I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are
> no documentation in the code why the barrier is here. The logs don't
> help either.
> 
> The barrier here will ensure that the load/store related to disabling
> the CPU are seen before any load/store happening after the return.
> Although, I am not sure why this is necessary.
> 
> Stefano, Do you remember the rationale?
> 
> Cheers,
>
Julien Grall April 1, 2019, 1:21 p.m. UTC | #5
Hi Juergen,

On 4/1/19 11:37 AM, Juergen Gross wrote:
> On 01/04/2019 12:29, Julien Grall wrote:
>> Hi,
>>
>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>>> There is no need to execute it on the cpu just being disabled, so use
>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>>> stop_machine() context is fine, as we just need to hold the domain RCU
>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>
>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>>> cases for cpu hotplug or suspend.
>>>>>
>>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>>>>> incarnation.
>>>>
>>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>>> provide more details on why this is not necessary?
>>>
>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>> It is based on clear_bit() which is defined to be atomic.
>>
>> atomicity does not mean the store/load cannot be re-ordered by the CPU.
>> You would need a barrier to prevent re-ordering.
>>
>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>> store/load can be re-ordered.
> 
> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
> there suggests the sequence of setting the blocked bit and doing the
> test is important for avoiding a race...

Hmmm... looking at the other usage (such as in do_poll), on non-x86 
platform, there is a smp_mb() between set_bit(...) and checking the 
event with a similar comment above.

I don't know enough the scheduler code to know why the barrier is 
needed. But for consistency, it seems to me the smp_mb() would be 
required in vcpu_block() as well.

Also, it is quite interesting that the barrier is not presence for x86. 
If I understand correctly the comment on top of set_bit/clear_bit, it 
could as well be re-ordered. So we seem to relying on the underlying 
implementation of set_bit/clear_bit.

Wouldn't it make sense to try to uniformize the semantics? Maybe by 
introducing a new helper?

Cheers,
Jürgen Groß April 1, 2019, 1:33 p.m. UTC | #6
On 01/04/2019 15:21, Julien Grall wrote:
> Hi Juergen,
> 
> On 4/1/19 11:37 AM, Juergen Gross wrote:
>> On 01/04/2019 12:29, Julien Grall wrote:
>>> Hi,
>>>
>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>>>> There is no need to execute it on the cpu just being disabled, so use
>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>>>> stop_machine() context is fine, as we just need to hold the domain
>>>>>> RCU
>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>
>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>>>> cases for cpu hotplug or suspend.
>>>>>>
>>>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>>>>>> incarnation.
>>>>>
>>>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>>>> provide more details on why this is not necessary?
>>>>
>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>> It is based on clear_bit() which is defined to be atomic.
>>>
>>> atomicity does not mean the store/load cannot be re-ordered by the CPU.
>>> You would need a barrier to prevent re-ordering.
>>>
>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>> store/load can be re-ordered.
>>
>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>> there suggests the sequence of setting the blocked bit and doing the
>> test is important for avoiding a race...
> 
> Hmmm... looking at the other usage (such as in do_poll), on non-x86
> platform, there is a smp_mb() between set_bit(...) and checking the
> event with a similar comment above.
> 
> I don't know enough the scheduler code to know why the barrier is
> needed. But for consistency, it seems to me the smp_mb() would be
> required in vcpu_block() as well.
> 
> Also, it is quite interesting that the barrier is not presence for x86.
> If I understand correctly the comment on top of set_bit/clear_bit, it
> could as well be re-ordered. So we seem to relying on the underlying
> implementation of set_bit/clear_bit.

On x86 reads and writes can't be reordered with locked operations (SDM
Vol 3 8.2.2). So the barrier is really not needed AFAIU.

include/asm-x86/bitops.h:

 * clear_bit() is atomic and may not be reordered.

> Wouldn't it make sense to try to uniformize the semantics? Maybe by
> introducing a new helper?

Or adding the barrier on ARM for the atomic operations?


Juergen
Julien Grall April 1, 2019, 2:01 p.m. UTC | #7
Hi,

On 4/1/19 2:33 PM, Juergen Gross wrote:
> On 01/04/2019 15:21, Julien Grall wrote:
>> Hi Juergen,
>>
>> On 4/1/19 11:37 AM, Juergen Gross wrote:
>>> On 01/04/2019 12:29, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>>>>> There is no need to execute it on the cpu just being disabled, so use
>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>>>>> stop_machine() context is fine, as we just need to hold the domain
>>>>>>> RCU
>>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>>
>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>>>>> cases for cpu hotplug or suspend.
>>>>>>>
>>>>>>> While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
>>>>>>> incarnation.
>>>>>>
>>>>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>>>>> provide more details on why this is not necessary?
>>>>>
>>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>>> It is based on clear_bit() which is defined to be atomic.
>>>>
>>>> atomicity does not mean the store/load cannot be re-ordered by the CPU.
>>>> You would need a barrier to prevent re-ordering.
>>>>
>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>>> store/load can be re-ordered.
>>>
>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>>> there suggests the sequence of setting the blocked bit and doing the
>>> test is important for avoiding a race...
>>
>> Hmmm... looking at the other usage (such as in do_poll), on non-x86
>> platform, there is a smp_mb() between set_bit(...) and checking the
>> event with a similar comment above.
>>
>> I don't know enough the scheduler code to know why the barrier is
>> needed. But for consistency, it seems to me the smp_mb() would be
>> required in vcpu_block() as well.
>>
>> Also, it is quite interesting that the barrier is not presence for x86.
>> If I understand correctly the comment on top of set_bit/clear_bit, it
>> could as well be re-ordered. So we seem to relying on the underlying
>> implementation of set_bit/clear_bit.
> 
> On x86 reads and writes can't be reordered with locked operations (SDM
> Vol 3 8.2.2). So the barrier is really not needed AFAIU.
> 
> include/asm-x86/bitops.h:
> 
>   * clear_bit() is atomic and may not be reordered.

I interpreted the "may not" as you should not rely on the re-ordering to 
not happen.

In place were re-ordering should not happen (e.g test_and_set_bit) we 
use the wording "cannot".

> 
>> Wouldn't it make sense to try to uniformize the semantics? Maybe by
>> introducing a new helper?
> 
> Or adding the barrier on ARM for the atomic operations?

On which basis?  Why should we impact every users for fixing a bug in 
the scheduler?

Cheers,
Jürgen Groß April 1, 2019, 2:23 p.m. UTC | #8
On 01/04/2019 16:01, Julien Grall wrote:
> Hi,
> 
> On 4/1/19 2:33 PM, Juergen Gross wrote:
>> On 01/04/2019 15:21, Julien Grall wrote:
>>> Hi Juergen,
>>>
>>> On 4/1/19 11:37 AM, Juergen Gross wrote:
>>>> On 01/04/2019 12:29, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>>>>>> There is no need to execute it on the cpu just being disabled,
>>>>>>>> so use
>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>>>>>> stop_machine() context is fine, as we just need to hold the domain
>>>>>>>> RCU
>>>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>>>
>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>>>>>> cases for cpu hotplug or suspend.
>>>>>>>>
>>>>>>>> While at it remove a superfluous smp_mb() in the ARM
>>>>>>>> __cpu_disable()
>>>>>>>> incarnation.
>>>>>>>
>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>>>>>> provide more details on why this is not necessary?
>>>>>>
>>>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>>>> It is based on clear_bit() which is defined to be atomic.
>>>>>
>>>>> atomicity does not mean the store/load cannot be re-ordered by the
>>>>> CPU.
>>>>> You would need a barrier to prevent re-ordering.
>>>>>
>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>>>> store/load can be re-ordered.
>>>>
>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>>>> there suggests the sequence of setting the blocked bit and doing the
>>>> test is important for avoiding a race...
>>>
>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86
>>> platform, there is a smp_mb() between set_bit(...) and checking the
>>> event with a similar comment above.
>>>
>>> I don't know enough the scheduler code to know why the barrier is
>>> needed. But for consistency, it seems to me the smp_mb() would be
>>> required in vcpu_block() as well.
>>>
>>> Also, it is quite interesting that the barrier is not presence for x86.
>>> If I understand correctly the comment on top of set_bit/clear_bit, it
>>> could as well be re-ordered. So we seem to relying on the underlying
>>> implementation of set_bit/clear_bit.
>>
>> On x86 reads and writes can't be reordered with locked operations (SDM
>> Vol 3 8.2.2). So the barrier is really not needed AFAIU.
>>
>> include/asm-x86/bitops.h:
>>
>>   * clear_bit() is atomic and may not be reordered.
> 
> I interpreted the "may not" as you should not rely on the re-ordering to
> not happen.
> 
> In place were re-ordering should not happen (e.g test_and_set_bit) we
> use the wording "cannot".

The SDM is very clear here:

"Reads or writes cannot be reordered with I/O instructions, locked
 instructions, or serializing instructions."

>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by
>>> introducing a new helper?
>>
>> Or adding the barrier on ARM for the atomic operations?
> 
> On which basis?  Why should we impact every users for fixing a bug in
> the scheduler?

I'm assuming there are more places like this either in common code or
code copied verbatim from arch/x86 to arch/arm with that problem.

So I take it you'd rather let me add that smp_mb() in __cpu_disable()
again.


Juergen
Julien Grall April 1, 2019, 3:15 p.m. UTC | #9
Hi,

On 4/1/19 3:23 PM, Juergen Gross wrote:
> On 01/04/2019 16:01, Julien Grall wrote:
>> Hi,
>>
>> On 4/1/19 2:33 PM, Juergen Gross wrote:
>>> On 01/04/2019 15:21, Julien Grall wrote:
>>>> Hi Juergen,
>>>>
>>>> On 4/1/19 11:37 AM, Juergen Gross wrote:
>>>>> On 01/04/2019 12:29, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable() today.
>>>>>>>>> There is no need to execute it on the cpu just being disabled,
>>>>>>>>> so use
>>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call out of
>>>>>>>>> stop_machine() context is fine, as we just need to hold the domain
>>>>>>>>> RCU
>>>>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>>>>
>>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in rare
>>>>>>>>> cases for cpu hotplug or suspend.
>>>>>>>>>
>>>>>>>>> While at it remove a superfluous smp_mb() in the ARM
>>>>>>>>> __cpu_disable()
>>>>>>>>> incarnation.
>>>>>>>>
>>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you please
>>>>>>>> provide more details on why this is not necessary?
>>>>>>>
>>>>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>>>>> It is based on clear_bit() which is defined to be atomic.
>>>>>>
>>>>>> atomicity does not mean the store/load cannot be re-ordered by the
>>>>>> CPU.
>>>>>> You would need a barrier to prevent re-ordering.
>>>>>>
>>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>>>>> store/load can be re-ordered.
>>>>>
>>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>>>>> there suggests the sequence of setting the blocked bit and doing the
>>>>> test is important for avoiding a race...
>>>>
>>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86
>>>> platform, there is a smp_mb() between set_bit(...) and checking the
>>>> event with a similar comment above.
>>>>
>>>> I don't know enough the scheduler code to know why the barrier is
>>>> needed. But for consistency, it seems to me the smp_mb() would be
>>>> required in vcpu_block() as well.
>>>>
>>>> Also, it is quite interesting that the barrier is not presence for x86.
>>>> If I understand correctly the comment on top of set_bit/clear_bit, it
>>>> could as well be re-ordered. So we seem to relying on the underlying
>>>> implementation of set_bit/clear_bit.
>>>
>>> On x86 reads and writes can't be reordered with locked operations (SDM
>>> Vol 3 8.2.2). So the barrier is really not needed AFAIU.
>>>
>>> include/asm-x86/bitops.h:
>>>
>>>    * clear_bit() is atomic and may not be reordered.
>>
>> I interpreted the "may not" as you should not rely on the re-ordering to
>> not happen.
>>
>> In place were re-ordering should not happen (e.g test_and_set_bit) we
>> use the wording "cannot".
> 
> The SDM is very clear here:
> 
> "Reads or writes cannot be reordered with I/O instructions, locked
>   instructions, or serializing instructions."

This is what the specification says not the intended semantic. Helper 
may have a more relaxed semantics to accommodate other architecture.

I believe, this is the case here. The semantic is more relaxed than the 
implementation. So you don't have to impose a barrier in architecture 
with a more relaxed memory ordering.

> 
>>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by
>>>> introducing a new helper?
>>>
>>> Or adding the barrier on ARM for the atomic operations?
>>
>> On which basis?  Why should we impact every users for fixing a bug in
>> the scheduler?
> 
> I'm assuming there are more places like this either in common code or
> code copied verbatim from arch/x86 to arch/arm with that problem.

Adding it in the *_set helpers is just the poor's man fix. If we do that 
this is going to stick for a long time and impact performance.

Instead we should fix the scheduler code (and hopefully only that) where 
the ordering is necessary.

> 
> So I take it you'd rather let me add that smp_mb() in __cpu_disable()
> again.

Removing/Adding barriers should be accompanied with a proper 
justifications in the commit message. Additionally, new barrier should 
have a comment explaining what they are for.

In this case, I don't know what is the correct answer. It feels to me we 
should keep it until we have a better understanding of this code. But 
then it raises the question whether a barrier would also be necessary 
after calling cpu_disable_scheduler().

Cheers,
Jürgen Groß April 1, 2019, 4 p.m. UTC | #10
On 01/04/2019 17:15, Julien Grall wrote:
> Hi,
> 
> On 4/1/19 3:23 PM, Juergen Gross wrote:
>> On 01/04/2019 16:01, Julien Grall wrote:
>>> Hi,
>>>
>>> On 4/1/19 2:33 PM, Juergen Gross wrote:
>>>> On 01/04/2019 15:21, Julien Grall wrote:
>>>>> Hi Juergen,
>>>>>
>>>>> On 4/1/19 11:37 AM, Juergen Gross wrote:
>>>>>> On 01/04/2019 12:29, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>>>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable()
>>>>>>>>>> today.
>>>>>>>>>> There is no need to execute it on the cpu just being disabled,
>>>>>>>>>> so use
>>>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call
>>>>>>>>>> out of
>>>>>>>>>> stop_machine() context is fine, as we just need to hold the
>>>>>>>>>> domain
>>>>>>>>>> RCU
>>>>>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>>>>>
>>>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in
>>>>>>>>>> rare
>>>>>>>>>> cases for cpu hotplug or suspend.
>>>>>>>>>>
>>>>>>>>>> While at it remove a superfluous smp_mb() in the ARM
>>>>>>>>>> __cpu_disable()
>>>>>>>>>> incarnation.
>>>>>>>>>
>>>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you
>>>>>>>>> please
>>>>>>>>> provide more details on why this is not necessary?
>>>>>>>>
>>>>>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>>>>>> It is based on clear_bit() which is defined to be atomic.
>>>>>>>
>>>>>>> atomicity does not mean the store/load cannot be re-ordered by the
>>>>>>> CPU.
>>>>>>> You would need a barrier to prevent re-ordering.
>>>>>>>
>>>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>>>>>> store/load can be re-ordered.
>>>>>>
>>>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>>>>>> there suggests the sequence of setting the blocked bit and doing the
>>>>>> test is important for avoiding a race...
>>>>>
>>>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86
>>>>> platform, there is a smp_mb() between set_bit(...) and checking the
>>>>> event with a similar comment above.
>>>>>
>>>>> I don't know enough the scheduler code to know why the barrier is
>>>>> needed. But for consistency, it seems to me the smp_mb() would be
>>>>> required in vcpu_block() as well.
>>>>>
>>>>> Also, it is quite interesting that the barrier is not presence for
>>>>> x86.
>>>>> If I understand correctly the comment on top of set_bit/clear_bit, it
>>>>> could as well be re-ordered. So we seem to relying on the underlying
>>>>> implementation of set_bit/clear_bit.
>>>>
>>>> On x86 reads and writes can't be reordered with locked operations (SDM
>>>> Vol 3 8.2.2). So the barrier is really not needed AFAIU.
>>>>
>>>> include/asm-x86/bitops.h:
>>>>
>>>>    * clear_bit() is atomic and may not be reordered.
>>>
>>> I interpreted the "may not" as you should not rely on the re-ordering to
>>> not happen.
>>>
>>> In place were re-ordering should not happen (e.g test_and_set_bit) we
>>> use the wording "cannot".
>>
>> The SDM is very clear here:
>>
>> "Reads or writes cannot be reordered with I/O instructions, locked
>>   instructions, or serializing instructions."
> 
> This is what the specification says not the intended semantic. Helper
> may have a more relaxed semantics to accommodate other architecture.
> 
> I believe, this is the case here. The semantic is more relaxed than the
> implementation. So you don't have to impose a barrier in architecture
> with a more relaxed memory ordering.
> 
>>
>>>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by
>>>>> introducing a new helper?
>>>>
>>>> Or adding the barrier on ARM for the atomic operations?
>>>
>>> On which basis?  Why should we impact every users for fixing a bug in
>>> the scheduler?
>>
>> I'm assuming there are more places like this either in common code or
>> code copied verbatim from arch/x86 to arch/arm with that problem.
> 
> Adding it in the *_set helpers is just the poor's man fix. If we do that
> this is going to stick for a long time and impact performance.
> 
> Instead we should fix the scheduler code (and hopefully only that) where
> the ordering is necessary.

I believe that should be a patch on its own. Are you doing that?

>> So I take it you'd rather let me add that smp_mb() in __cpu_disable()
>> again.
> 
> Removing/Adding barriers should be accompanied with a proper
> justifications in the commit message. Additionally, new barrier should
> have a comment explaining what they are for.
> 
> In this case, I don't know what is the correct answer. It feels to me we
> should keep it until we have a better understanding of this code. But

Okay.

> then it raises the question whether a barrier would also be necessary
> after calling cpu_disable_scheduler().

That one is quite easy: all paths of cpu_disable_scheduler() are doing
an unlock operation at the end, so the barrier is already there.


Juergen
Julien Grall April 1, 2019, 5:17 p.m. UTC | #11
Hi,

On 4/1/19 5:00 PM, Juergen Gross wrote:
> On 01/04/2019 17:15, Julien Grall wrote:
>> Hi,
>>
>> On 4/1/19 3:23 PM, Juergen Gross wrote:
>>> On 01/04/2019 16:01, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> On 4/1/19 2:33 PM, Juergen Gross wrote:
>>>>> On 01/04/2019 15:21, Julien Grall wrote:
>>>>>> Hi Juergen,
>>>>>>
>>>>>> On 4/1/19 11:37 AM, Juergen Gross wrote:
>>>>>>> On 01/04/2019 12:29, Julien Grall wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 4/1/19 10:40 AM, Juergen Gross wrote:
>>>>>>>>> On 01/04/2019 11:21, Julien Grall wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 3/29/19 3:08 PM, Juergen Gross wrote:
>>>>>>>>>>> cpu_disable_scheduler() is being called from __cpu_disable()
>>>>>>>>>>> today.
>>>>>>>>>>> There is no need to execute it on the cpu just being disabled,
>>>>>>>>>>> so use
>>>>>>>>>>> the CPU_DEAD case of the cpu notifier chain. Moving the call
>>>>>>>>>>> out of
>>>>>>>>>>> stop_machine() context is fine, as we just need to hold the
>>>>>>>>>>> domain
>>>>>>>>>>> RCU
>>>>>>>>>>> lock and need the scheduler percpu data to be still allocated.
>>>>>>>>>>>
>>>>>>>>>>> Add another hook for CPU_DOWN_PREPARE to bail out early in case
>>>>>>>>>>> cpu_disable_scheduler() would fail. This will avoid crashes in
>>>>>>>>>>> rare
>>>>>>>>>>> cases for cpu hotplug or suspend.
>>>>>>>>>>>
>>>>>>>>>>> While at it remove a superfluous smp_mb() in the ARM
>>>>>>>>>>> __cpu_disable()
>>>>>>>>>>> incarnation.
>>>>>>>>>>
>>>>>>>>>> This is not obvious why the smp_mb() is superfluous. Can you
>>>>>>>>>> please
>>>>>>>>>> provide more details on why this is not necessary?
>>>>>>>>>
>>>>>>>>> cpumask_clear_cpu() should already have the needed semantics, no?
>>>>>>>>> It is based on clear_bit() which is defined to be atomic.
>>>>>>>>
>>>>>>>> atomicity does not mean the store/load cannot be re-ordered by the
>>>>>>>> CPU.
>>>>>>>> You would need a barrier to prevent re-ordering.
>>>>>>>>
>>>>>>>> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
>>>>>>>> store/load can be re-ordered.
>>>>>>>
>>>>>>> Uh, couldn't this lead to problems, e.g. in vcpu_block()? The comment
>>>>>>> there suggests the sequence of setting the blocked bit and doing the
>>>>>>> test is important for avoiding a race...
>>>>>>
>>>>>> Hmmm... looking at the other usage (such as in do_poll), on non-x86
>>>>>> platform, there is a smp_mb() between set_bit(...) and checking the
>>>>>> event with a similar comment above.
>>>>>>
>>>>>> I don't know enough the scheduler code to know why the barrier is
>>>>>> needed. But for consistency, it seems to me the smp_mb() would be
>>>>>> required in vcpu_block() as well.
>>>>>>
>>>>>> Also, it is quite interesting that the barrier is not presence for
>>>>>> x86.
>>>>>> If I understand correctly the comment on top of set_bit/clear_bit, it
>>>>>> could as well be re-ordered. So we seem to relying on the underlying
>>>>>> implementation of set_bit/clear_bit.
>>>>>
>>>>> On x86 reads and writes can't be reordered with locked operations (SDM
>>>>> Vol 3 8.2.2). So the barrier is really not needed AFAIU.
>>>>>
>>>>> include/asm-x86/bitops.h:
>>>>>
>>>>>     * clear_bit() is atomic and may not be reordered.
>>>>
>>>> I interpreted the "may not" as you should not rely on the re-ordering to
>>>> not happen.
>>>>
>>>> In place were re-ordering should not happen (e.g test_and_set_bit) we
>>>> use the wording "cannot".
>>>
>>> The SDM is very clear here:
>>>
>>> "Reads or writes cannot be reordered with I/O instructions, locked
>>>    instructions, or serializing instructions."
>>
>> This is what the specification says not the intended semantic. Helper
>> may have a more relaxed semantics to accommodate other architecture.
>>
>> I believe, this is the case here. The semantic is more relaxed than the
>> implementation. So you don't have to impose a barrier in architecture
>> with a more relaxed memory ordering.
>>
>>>
>>>>>> Wouldn't it make sense to try to uniformize the semantics? Maybe by
>>>>>> introducing a new helper?
>>>>>
>>>>> Or adding the barrier on ARM for the atomic operations?
>>>>
>>>> On which basis?  Why should we impact every users for fixing a bug in
>>>> the scheduler?
>>>
>>> I'm assuming there are more places like this either in common code or
>>> code copied verbatim from arch/x86 to arch/arm with that problem.
>>
>> Adding it in the *_set helpers is just the poor's man fix. If we do that
>> this is going to stick for a long time and impact performance.
>>
>> Instead we should fix the scheduler code (and hopefully only that) where
>> the ordering is necessary.
> 
> I believe that should be a patch on its own. Are you doing that?

I will try to have a look tomorrow.

> 
>>> So I take it you'd rather let me add that smp_mb() in __cpu_disable()
>>> again.
>>
>> Removing/Adding barriers should be accompanied with a proper
>> justifications in the commit message. Additionally, new barrier should
>> have a comment explaining what they are for.
>>
>> In this case, I don't know what is the correct answer. It feels to me we
>> should keep it until we have a better understanding of this code. But
> 
> Okay.
> 
>> then it raises the question whether a barrier would also be necessary
>> after calling cpu_disable_scheduler().
> 
> That one is quite easy: all paths of cpu_disable_scheduler() are doing
> an unlock operation at the end, so the barrier is already there.

Oh, nothing to worry then :). Thank you for look at it.

Cheers,
Stefano Stabellini April 16, 2019, 7:34 p.m. UTC | #12
On Mon, 1 Apr 2019, Julien Grall wrote:
> Hi,
> 
> On 4/1/19 10:40 AM, Juergen Gross wrote:
> > On 01/04/2019 11:21, Julien Grall wrote:
> > > Hi,
> > > 
> > > On 3/29/19 3:08 PM, Juergen Gross wrote:
> > > > cpu_disable_scheduler() is being called from __cpu_disable() today.
> > > > There is no need to execute it on the cpu just being disabled, so use
> > > > the CPU_DEAD case of the cpu notifier chain. Moving the call out of
> > > > stop_machine() context is fine, as we just need to hold the domain RCU
> > > > lock and need the scheduler percpu data to be still allocated.
> > > > 
> > > > Add another hook for CPU_DOWN_PREPARE to bail out early in case
> > > > cpu_disable_scheduler() would fail. This will avoid crashes in rare
> > > > cases for cpu hotplug or suspend.
> > > > 
> > > > While at it remove a superfluous smp_mb() in the ARM __cpu_disable()
> > > > incarnation.
> > > 
> > > This is not obvious why the smp_mb() is superfluous. Can you please
> > > provide more details on why this is not necessary?
> > 
> > cpumask_clear_cpu() should already have the needed semantics, no?
> > It is based on clear_bit() which is defined to be atomic.
> 
> atomicity does not mean the store/load cannot be re-ordered by the CPU. You
> would need a barrier to prevent re-ordering.
> 
> cpumask_clear_cpu() and clear_bit() does not contain any barrier, so
> store/load can be re-ordered.
> 
> I see we have similar smp_mb() barrier in __cpu_die(). Sadly, there are no
> documentation in the code why the barrier is here. The logs don't help either.
> 
> The barrier here will ensure that the load/store related to disabling the CPU
> are seen before any load/store happening after the return. Although, I am not
> sure why this is necessary.
> 
> Stefano, Do you remember the rationale?

/me doing some archeology

I am pretty sure it was meant to accompany the cpumask_clear_cpu call. I
think we should keep it in __cpu_disable right after cpumask_clear_cpu.
diff mbox series

Patch

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index 25cd44549c..0728a9b505 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -386,10 +386,6 @@  void __cpu_disable(void)
     /* It's now safe to remove this processor from the online map */
     cpumask_clear_cpu(cpu, &cpu_online_map);
 
-    if ( cpu_disable_scheduler(cpu) )
-        BUG();
-    smp_mb();
-
     /* Return to caller; eventually the IPI mechanism will unwind and the 
      * scheduler will drop to the idle loop, which will call stop_cpu(). */
 }
diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c
index 7d1226d7bc..b7a0a4a419 100644
--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -1221,9 +1221,6 @@  void __cpu_disable(void)
     cpumask_clear_cpu(cpu, &cpu_online_map);
     fixup_irqs(&cpu_online_map, 1);
     fixup_eoi();
-
-    if ( cpu_disable_scheduler(cpu) )
-        BUG();
 }
 
 void __cpu_die(unsigned int cpu)
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 60755a631e..5d2bbd5198 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -773,8 +773,9 @@  void restore_vcpu_affinity(struct domain *d)
 }
 
 /*
- * This function is used by cpu_hotplug code from stop_machine context
+ * This function is used by cpu_hotplug code via cpu notifier chain
  * and from cpupools to switch schedulers on a cpu.
+ * Caller must get domlist_read_lock.
  */
 int cpu_disable_scheduler(unsigned int cpu)
 {
@@ -789,12 +790,6 @@  int cpu_disable_scheduler(unsigned int cpu)
     if ( c == NULL )
         return ret;
 
-    /*
-     * We'd need the domain RCU lock, but:
-     *  - when we are called from cpupool code, it's acquired there already;
-     *  - when we are called for CPU teardown, we're in stop-machine context,
-     *    so that's not be a problem.
-     */
     for_each_domain_in_cpupool ( d, c )
     {
         for_each_vcpu ( d, v )
@@ -893,6 +888,30 @@  int cpu_disable_scheduler(unsigned int cpu)
     return ret;
 }
 
+static int cpu_disable_scheduler_check(unsigned int cpu)
+{
+    struct domain *d;
+    struct vcpu *v;
+    struct cpupool *c;
+
+    c = per_cpu(cpupool, cpu);
+    if ( c == NULL )
+        return 0;
+
+    for_each_domain_in_cpupool ( d, c )
+    {
+        for_each_vcpu ( d, v )
+        {
+            if ( v->affinity_broken )
+                return -EADDRINUSE;
+            if ( system_state != SYS_STATE_suspend && v->processor == cpu )
+                return -EAGAIN;
+        }
+    }
+
+    return 0;
+}
+
 /*
  * In general, this must be called with the scheduler lock held, because the
  * adjust_affinity hook may want to modify the vCPU state. However, when the
@@ -1737,7 +1756,16 @@  static int cpu_schedule_callback(
     case CPU_UP_PREPARE:
         rc = cpu_schedule_up(cpu);
         break;
+    case CPU_DOWN_PREPARE:
+        rcu_read_lock(&domlist_read_lock);
+        rc = cpu_disable_scheduler_check(cpu);
+        rcu_read_unlock(&domlist_read_lock);
+        break;
     case CPU_DEAD:
+        rcu_read_lock(&domlist_read_lock);
+        rc = cpu_disable_scheduler(cpu);
+        BUG_ON(rc);
+        rcu_read_unlock(&domlist_read_lock);
         SCHED_OP(sched, deinit_pdata, sd->sched_priv, cpu);
         /* Fallthrough */
     case CPU_UP_CANCELED: