diff mbox series

xen/sched: fix locking in sched_tick_[suspend|resume]()

Message ID 20191004064010.25646-1-jgross@suse.com (mailing list archive)
State New, archived
Headers show
Series xen/sched: fix locking in sched_tick_[suspend|resume]() | expand

Commit Message

Juergen Gross Oct. 4, 2019, 6:40 a.m. UTC
sched_tick_suspend() and sched_tick_resume() should not call the
scheduler specific timer handlers in case the cpu they are running on
is just being moved to or from a cpupool.

Use a new percpu lock for that purpose.

Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
---
To be applied on top of my core scheduling series.
---
 xen/common/schedule.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Andrew Cooper Oct. 4, 2019, 7:50 a.m. UTC | #1
On 04/10/2019 07:40, Juergen Gross wrote:
> sched_tick_suspend() and sched_tick_resume() should not call the
> scheduler specific timer handlers in case the cpu they are running on
> is just being moved to or from a cpupool.
>
> Use a new percpu lock for that purpose.
>
> Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
> To be applied on top of my core scheduling series.

There is a somewhat interesting stack trace to go with this

(XEN) Testing NMI watchdog on all CPUs: ok
(XEN) ----[ Xen-4.13.0-8.0.12-d  x86_64  debug=y   Not tainted ]----
(XEN) CPU:    79
(XEN) RIP:    e008:[<ffff82d0802467dc>] set_timer+0x39/0x1f7
(XEN) RFLAGS: 0000000000010002   CONTEXT: hypervisor
<snip>
(XEN) Xen call trace:
(XEN)    [<ffff82d0802467dc>] set_timer+0x39/0x1f7
(XEN)    [<ffff82d08022c1f4>] sched_credit.c#csched_tick_resume+0x54/0x59
(XEN)    [<ffff82d080241dfe>] sched_tick_resume+0x67/0x86
(XEN)    [<ffff82d0802eda52>] mwait-idle.c#mwait_idle+0x32b/0x357
(XEN)    [<ffff82d08027939e>] domain.c#idle_loop+0xa6/0xc2
(XEN)
(XEN) Running stub recovery selftests...
(XEN) Pagetable walk from 0000000000000048:
(XEN) traps.c:1564: GPF (0000): ffff82d0bffff041 [ffff82d0bffff041] ->
ffff82d0803893f2
(XEN)  L4[0x000] = 00000082cfb9c063 ffffffffffffffff
(XEN) traps.c:759: Trap 12: ffff82d0bffff040 [ffff82d0bffff040] ->
ffff82d0803893f2
(XEN)  L3[0x000] = 00000082cfb9b063 ffffffffffffffff
(XEN) traps.c:1098: Trap 3: ffff82d0bffff041 [ffff82d0bffff041] ->
ffff82d0803893f2
(XEN)  L2[0x000] = 00000082cfb9a063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 79:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000048
(XEN) ****************************************

which demonstrates CPU 79 exploding straight out of idle, while the BSP
is continuing to boot.

~Andrew
George Dunlap Oct. 4, 2019, 2:08 p.m. UTC | #2
On 10/4/19 7:40 AM, Juergen Gross wrote:
> sched_tick_suspend() and sched_tick_resume() should not call the
> scheduler specific timer handlers in case the cpu they are running on
> is just being moved to or from a cpupool.
> 
> Use a new percpu lock for that purpose.

Is there a reason we can't use the pcpu_schedule_lock() instead of
introducing a new one?  Sorry if this is obvious, but it's been a while
since I poked around this code.

 -George
Juergen Gross Oct. 4, 2019, 2:24 p.m. UTC | #3
On 04.10.19 16:08, George Dunlap wrote:
> On 10/4/19 7:40 AM, Juergen Gross wrote:
>> sched_tick_suspend() and sched_tick_resume() should not call the
>> scheduler specific timer handlers in case the cpu they are running on
>> is just being moved to or from a cpupool.
>>
>> Use a new percpu lock for that purpose.
> 
> Is there a reason we can't use the pcpu_schedule_lock() instead of
> introducing a new one?  Sorry if this is obvious, but it's been a while
> since I poked around this code.

Lock contention would be higher especially with credit2 which is using a
per-core or even per-socket lock. We don't care about other scheduling
activity here, all we need is a guard against our per-cpu scheduler
data being changed beneath our feet.


Juergen
Dario Faggioli Oct. 4, 2019, 2:29 p.m. UTC | #4
On Fri, 2019-10-04 at 08:50 +0100, Andrew Cooper wrote:
> On 04/10/2019 07:40, Juergen Gross wrote:
> > sched_tick_suspend() and sched_tick_resume() should not call the
> > scheduler specific timer handlers in case the cpu they are running
> > on
> > is just being moved to or from a cpupool.
> > 
> > Use a new percpu lock for that purpose.
> > 
> > Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
> > Signed-off-by: Juergen Gross <jgross@suse.com>
> > ---
> > To be applied on top of my core scheduling series.
> 
> There is a somewhat interesting stack trace to go with this
> 
Sorry, I don't think I get this. Does it mean that you see the
stacktrace below _even_with_ the patch applied?

> (XEN) Testing NMI watchdog on all CPUs: ok
> (XEN) ----[ Xen-4.13.0-8.0.12-d  x86_64  debug=y   Not tainted ]----
> (XEN) CPU:    79
> (XEN) RIP:    e008:[<ffff82d0802467dc>] set_timer+0x39/0x1f7
> (XEN) RFLAGS: 0000000000010002   CONTEXT: hypervisor
> <snip>
> (XEN) Xen call trace:
> (XEN)    [<ffff82d0802467dc>] set_timer+0x39/0x1f7
> (XEN)    [<ffff82d08022c1f4>]
> sched_credit.c#csched_tick_resume+0x54/0x59
> (XEN)    [<ffff82d080241dfe>] sched_tick_resume+0x67/0x86
> (XEN)    [<ffff82d0802eda52>] mwait-idle.c#mwait_idle+0x32b/0x357
> (XEN)    [<ffff82d08027939e>] domain.c#idle_loop+0xa6/0xc2
> (XEN)
> (XEN) Running stub recovery selftests...
> (XEN) Pagetable walk from 0000000000000048:
> (XEN) traps.c:1564: GPF (0000): ffff82d0bffff041 [ffff82d0bffff041]
> ->
> ffff82d0803893f2
> (XEN)  L4[0x000] = 00000082cfb9c063 ffffffffffffffff
> (XEN) traps.c:759: Trap 12: ffff82d0bffff040 [ffff82d0bffff040] ->
> ffff82d0803893f2
> (XEN)  L3[0x000] = 00000082cfb9b063 ffffffffffffffff
> (XEN) traps.c:1098: Trap 3: ffff82d0bffff041 [ffff82d0bffff041] ->
> ffff82d0803893f2
> (XEN)  L2[0x000] = 00000082cfb9a063 ffffffffffffffff
> (XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 79:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: 0000000000000048
> (XEN) ****************************************
> 
> which demonstrates CPU 79 exploding straight out of idle, while the
> BSP
> is continuing to boot.
George Dunlap Oct. 4, 2019, 2:34 p.m. UTC | #5
On 10/4/19 3:24 PM, Jürgen Groß wrote:
> On 04.10.19 16:08, George Dunlap wrote:
>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>> scheduler specific timer handlers in case the cpu they are running on
>>> is just being moved to or from a cpupool.
>>>
>>> Use a new percpu lock for that purpose.
>>
>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>> introducing a new one?  Sorry if this is obvious, but it's been a while
>> since I poked around this code.
> 
> Lock contention would be higher especially with credit2 which is using a
> per-core or even per-socket lock. We don't care about other scheduling
> activity here, all we need is a guard against our per-cpu scheduler
> data being changed beneath our feet.

Is this code really being called so often that we need to worry about
this level of contention?

We already have a *lot* of locks; and in this case you're adding a
second lock which interacts with the per-scheduler cpu lock.  This just
seems like asking for trouble.

I won't Nack the patch, but I don't think I would ack it without clear
evidence that the extra lock has a performance improvement that's worth
the cost of the extra complexity.

 -George
Juergen Gross Oct. 4, 2019, 2:43 p.m. UTC | #6
On 04.10.19 16:34, George Dunlap wrote:
> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>> On 04.10.19 16:08, George Dunlap wrote:
>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>> scheduler specific timer handlers in case the cpu they are running on
>>>> is just being moved to or from a cpupool.
>>>>
>>>> Use a new percpu lock for that purpose.
>>>
>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>> introducing a new one?  Sorry if this is obvious, but it's been a while
>>> since I poked around this code.
>>
>> Lock contention would be higher especially with credit2 which is using a
>> per-core or even per-socket lock. We don't care about other scheduling
>> activity here, all we need is a guard against our per-cpu scheduler
>> data being changed beneath our feet.
> 
> Is this code really being called so often that we need to worry about
> this level of contention?

Its called each time idle is entered and left again.

Especially with core scheduling there is a high probability of multiple
cpus leaving idle at the same time and the per-scheduler lock being used
in parallel already.

> We already have a *lot* of locks; and in this case you're adding a
> second lock which interacts with the per-scheduler cpu lock.  This just
> seems like asking for trouble.

In which way does it interact with the per-scheduler cpu lock?

> I won't Nack the patch, but I don't think I would ack it without clear
> evidence that the extra lock has a performance improvement that's worth
> the cost of the extra complexity.

I think complexity is lower this way. Especially considering the per-
scheduler lock changing with moving a cpu to or from a cpupool.


Juergen
Andrew Cooper Oct. 4, 2019, 2:49 p.m. UTC | #7
On 04/10/2019 15:29, Dario Faggioli wrote:
> On Fri, 2019-10-04 at 08:50 +0100, Andrew Cooper wrote:
>> On 04/10/2019 07:40, Juergen Gross wrote:
>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>> scheduler specific timer handlers in case the cpu they are running
>>> on
>>> is just being moved to or from a cpupool.
>>>
>>> Use a new percpu lock for that purpose.
>>>
>>> Reported-by: Sergey Dyasli <sergey.dyasli@citrix.com>
>>> Signed-off-by: Juergen Gross <jgross@suse.com>
>>> ---
>>> To be applied on top of my core scheduling series.
>> There is a somewhat interesting stack trace to go with this
>>
> Sorry, I don't think I get this. Does it mean that you see the
> stacktrace below _even_with_ the patch applied?

I wasn't very clear.  This was the stack trace which Igor found during
testing, which lead to us reporting the bug which is fixed by this patch.

It might be interesting to include the stack trace in the commit message.

~Andrew
George Dunlap Oct. 4, 2019, 2:56 p.m. UTC | #8
On 10/4/19 3:43 PM, Jürgen Groß wrote:
> On 04.10.19 16:34, George Dunlap wrote:
>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>> On 04.10.19 16:08, George Dunlap wrote:
>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>> scheduler specific timer handlers in case the cpu they are running on
>>>>> is just being moved to or from a cpupool.
>>>>>
>>>>> Use a new percpu lock for that purpose.
>>>>
>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>> introducing a new one?  Sorry if this is obvious, but it's been a while
>>>> since I poked around this code.
>>>
>>> Lock contention would be higher especially with credit2 which is using a
>>> per-core or even per-socket lock. We don't care about other scheduling
>>> activity here, all we need is a guard against our per-cpu scheduler
>>> data being changed beneath our feet.
>>
>> Is this code really being called so often that we need to worry about
>> this level of contention?
> 
> Its called each time idle is entered and left again.
> 
> Especially with core scheduling there is a high probability of multiple
> cpus leaving idle at the same time and the per-scheduler lock being used
> in parallel already.

Hrm, that does sound pretty bad.

>> We already have a *lot* of locks; and in this case you're adding a
>> second lock which interacts with the per-scheduler cpu lock.  This just
>> seems like asking for trouble.
> 
> In which way does it interact with the per-scheduler cpu lock?
> 
>> I won't Nack the patch, but I don't think I would ack it without clear
>> evidence that the extra lock has a performance improvement that's worth
>> the cost of the extra complexity.
> 
> I think complexity is lower this way. Especially considering the per-
> scheduler lock changing with moving a cpu to or from a cpupool.

The key aspect of the per-scheduler lock is that once you hold it, the
pointer to the lock can't change.

After this patch, the fact remains that sometimes you need to grab one
lock, sometimes the other, and sometimes both.

And, tick_suspend() lives in the per-scheduler code.  Each scheduler has
to remember that tick_suspend and tick_resume hold a completely
different lock to the rest of the scheduling functions.

 -George
Juergen Gross Oct. 4, 2019, 3:03 p.m. UTC | #9
On 04.10.19 16:56, George Dunlap wrote:
> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>> On 04.10.19 16:34, George Dunlap wrote:
>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>> scheduler specific timer handlers in case the cpu they are running on
>>>>>> is just being moved to or from a cpupool.
>>>>>>
>>>>>> Use a new percpu lock for that purpose.
>>>>>
>>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>>> introducing a new one?  Sorry if this is obvious, but it's been a while
>>>>> since I poked around this code.
>>>>
>>>> Lock contention would be higher especially with credit2 which is using a
>>>> per-core or even per-socket lock. We don't care about other scheduling
>>>> activity here, all we need is a guard against our per-cpu scheduler
>>>> data being changed beneath our feet.
>>>
>>> Is this code really being called so often that we need to worry about
>>> this level of contention?
>>
>> Its called each time idle is entered and left again.
>>
>> Especially with core scheduling there is a high probability of multiple
>> cpus leaving idle at the same time and the per-scheduler lock being used
>> in parallel already.
> 
> Hrm, that does sound pretty bad.
> 
>>> We already have a *lot* of locks; and in this case you're adding a
>>> second lock which interacts with the per-scheduler cpu lock.  This just
>>> seems like asking for trouble.
>>
>> In which way does it interact with the per-scheduler cpu lock?
>>
>>> I won't Nack the patch, but I don't think I would ack it without clear
>>> evidence that the extra lock has a performance improvement that's worth
>>> the cost of the extra complexity.
>>
>> I think complexity is lower this way. Especially considering the per-
>> scheduler lock changing with moving a cpu to or from a cpupool.
> 
> The key aspect of the per-scheduler lock is that once you hold it, the
> pointer to the lock can't change.
> 
> After this patch, the fact remains that sometimes you need to grab one
> lock, sometimes the other, and sometimes both.
> 
> And, tick_suspend() lives in the per-scheduler code.  Each scheduler has
> to remember that tick_suspend and tick_resume hold a completely
> different lock to the rest of the scheduling functions.

Is that really so critical? Today only credit1 has tick_suspend and
tick_resume hooks, and both are really very simple. I can add a
comment in sched-if.h if you like.

And up to now there was no lock at all involved when calling them...

If you think using the normal scheduler lock is to be preferred I'd
be happy to change the patch. But I should mention I was already
planning to revisit usage of the scheduler lock and replace it by the
new per-cpu lock where appropriate (not sure I'd find any appropriate
path for replacement).


Juergen
George Dunlap Oct. 4, 2019, 3:37 p.m. UTC | #10
On 10/4/19 4:03 PM, Jürgen Groß wrote:
> On 04.10.19 16:56, George Dunlap wrote:
>> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>>> On 04.10.19 16:34, George Dunlap wrote:
>>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>>> scheduler specific timer handlers in case the cpu they are
>>>>>>> running on
>>>>>>> is just being moved to or from a cpupool.
>>>>>>>
>>>>>>> Use a new percpu lock for that purpose.
>>>>>>
>>>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>>>> introducing a new one?  Sorry if this is obvious, but it's been a
>>>>>> while
>>>>>> since I poked around this code.
>>>>>
>>>>> Lock contention would be higher especially with credit2 which is
>>>>> using a
>>>>> per-core or even per-socket lock. We don't care about other scheduling
>>>>> activity here, all we need is a guard against our per-cpu scheduler
>>>>> data being changed beneath our feet.
>>>>
>>>> Is this code really being called so often that we need to worry about
>>>> this level of contention?
>>>
>>> Its called each time idle is entered and left again.
>>>
>>> Especially with core scheduling there is a high probability of multiple
>>> cpus leaving idle at the same time and the per-scheduler lock being used
>>> in parallel already.
>>
>> Hrm, that does sound pretty bad.
>>
>>>> We already have a *lot* of locks; and in this case you're adding a
>>>> second lock which interacts with the per-scheduler cpu lock.  This just
>>>> seems like asking for trouble.
>>>
>>> In which way does it interact with the per-scheduler cpu lock?
>>>
>>>> I won't Nack the patch, but I don't think I would ack it without clear
>>>> evidence that the extra lock has a performance improvement that's worth
>>>> the cost of the extra complexity.
>>>
>>> I think complexity is lower this way. Especially considering the per-
>>> scheduler lock changing with moving a cpu to or from a cpupool.
>>
>> The key aspect of the per-scheduler lock is that once you hold it, the
>> pointer to the lock can't change.
>>
>> After this patch, the fact remains that sometimes you need to grab one
>> lock, sometimes the other, and sometimes both.
>>
>> And, tick_suspend() lives in the per-scheduler code.  Each scheduler has
>> to remember that tick_suspend and tick_resume hold a completely
>> different lock to the rest of the scheduling functions.
> 
> Is that really so critical? Today only credit1 has tick_suspend and
> tick_resume hooks, and both are really very simple. I can add a
> comment in sched-if.h if you like.
> 
> And up to now there was no lock at all involved when calling them...
> 
> If you think using the normal scheduler lock is to be preferred I'd
> be happy to change the patch. But I should mention I was already
> planning to revisit usage of the scheduler lock and replace it by the
> new per-cpu lock where appropriate (not sure I'd find any appropriate
> path for replacement).

Well the really annoying thing here is that all the other schedulers --
in particular, credit2, which as you say, is designed to have multiple
runqueues share the same lock -- have to grab & release the lock just to
find out that there's nothing to do.

And even credit1 doesn't do anything particularly clever -- all it does
is stop and start a timer based on a scheduler-global configuration. And
the scheduling lock is grabbed to switch to idle anyway.  It seems like
we should be able to do something more sensible.

/me is still thinking

 -George
Juergen Gross Oct. 4, 2019, 3:40 p.m. UTC | #11
On 04.10.19 17:37, George Dunlap wrote:
> On 10/4/19 4:03 PM, Jürgen Groß wrote:
>> On 04.10.19 16:56, George Dunlap wrote:
>>> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>>>> On 04.10.19 16:34, George Dunlap wrote:
>>>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>>>> scheduler specific timer handlers in case the cpu they are
>>>>>>>> running on
>>>>>>>> is just being moved to or from a cpupool.
>>>>>>>>
>>>>>>>> Use a new percpu lock for that purpose.
>>>>>>>
>>>>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>>>>> introducing a new one?  Sorry if this is obvious, but it's been a
>>>>>>> while
>>>>>>> since I poked around this code.
>>>>>>
>>>>>> Lock contention would be higher especially with credit2 which is
>>>>>> using a
>>>>>> per-core or even per-socket lock. We don't care about other scheduling
>>>>>> activity here, all we need is a guard against our per-cpu scheduler
>>>>>> data being changed beneath our feet.
>>>>>
>>>>> Is this code really being called so often that we need to worry about
>>>>> this level of contention?
>>>>
>>>> Its called each time idle is entered and left again.
>>>>
>>>> Especially with core scheduling there is a high probability of multiple
>>>> cpus leaving idle at the same time and the per-scheduler lock being used
>>>> in parallel already.
>>>
>>> Hrm, that does sound pretty bad.
>>>
>>>>> We already have a *lot* of locks; and in this case you're adding a
>>>>> second lock which interacts with the per-scheduler cpu lock.  This just
>>>>> seems like asking for trouble.
>>>>
>>>> In which way does it interact with the per-scheduler cpu lock?
>>>>
>>>>> I won't Nack the patch, but I don't think I would ack it without clear
>>>>> evidence that the extra lock has a performance improvement that's worth
>>>>> the cost of the extra complexity.
>>>>
>>>> I think complexity is lower this way. Especially considering the per-
>>>> scheduler lock changing with moving a cpu to or from a cpupool.
>>>
>>> The key aspect of the per-scheduler lock is that once you hold it, the
>>> pointer to the lock can't change.
>>>
>>> After this patch, the fact remains that sometimes you need to grab one
>>> lock, sometimes the other, and sometimes both.
>>>
>>> And, tick_suspend() lives in the per-scheduler code.  Each scheduler has
>>> to remember that tick_suspend and tick_resume hold a completely
>>> different lock to the rest of the scheduling functions.
>>
>> Is that really so critical? Today only credit1 has tick_suspend and
>> tick_resume hooks, and both are really very simple. I can add a
>> comment in sched-if.h if you like.
>>
>> And up to now there was no lock at all involved when calling them...
>>
>> If you think using the normal scheduler lock is to be preferred I'd
>> be happy to change the patch. But I should mention I was already
>> planning to revisit usage of the scheduler lock and replace it by the
>> new per-cpu lock where appropriate (not sure I'd find any appropriate
>> path for replacement).
> 
> Well the really annoying thing here is that all the other schedulers --
> in particular, credit2, which as you say, is designed to have multiple
> runqueues share the same lock -- have to grab & release the lock just to
> find out that there's nothing to do.
> 
> And even credit1 doesn't do anything particularly clever -- all it does
> is stop and start a timer based on a scheduler-global configuration. And
> the scheduling lock is grabbed to switch to idle anyway.  It seems like
> we should be able to do something more sensible.

Yeah, I thought the same.

> 
> /me is still thinking

I think we shouldn't rush that in.


Juergen
George Dunlap Oct. 4, 2019, 4:09 p.m. UTC | #12
On 10/4/19 4:40 PM, Jürgen Groß wrote:
> On 04.10.19 17:37, George Dunlap wrote:
>> On 10/4/19 4:03 PM, Jürgen Groß wrote:
>>> On 04.10.19 16:56, George Dunlap wrote:
>>>> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>>>>> On 04.10.19 16:34, George Dunlap wrote:
>>>>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>>>>> scheduler specific timer handlers in case the cpu they are
>>>>>>>>> running on
>>>>>>>>> is just being moved to or from a cpupool.
>>>>>>>>>
>>>>>>>>> Use a new percpu lock for that purpose.
>>>>>>>>
>>>>>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>>>>>> introducing a new one?  Sorry if this is obvious, but it's been a
>>>>>>>> while
>>>>>>>> since I poked around this code.
>>>>>>>
>>>>>>> Lock contention would be higher especially with credit2 which is
>>>>>>> using a
>>>>>>> per-core or even per-socket lock. We don't care about other
>>>>>>> scheduling
>>>>>>> activity here, all we need is a guard against our per-cpu scheduler
>>>>>>> data being changed beneath our feet.
>>>>>>
>>>>>> Is this code really being called so often that we need to worry about
>>>>>> this level of contention?
>>>>>
>>>>> Its called each time idle is entered and left again.
>>>>>
>>>>> Especially with core scheduling there is a high probability of
>>>>> multiple
>>>>> cpus leaving idle at the same time and the per-scheduler lock being
>>>>> used
>>>>> in parallel already.
>>>>
>>>> Hrm, that does sound pretty bad.
>>>>
>>>>>> We already have a *lot* of locks; and in this case you're adding a
>>>>>> second lock which interacts with the per-scheduler cpu lock.  This
>>>>>> just
>>>>>> seems like asking for trouble.
>>>>>
>>>>> In which way does it interact with the per-scheduler cpu lock?
>>>>>
>>>>>> I won't Nack the patch, but I don't think I would ack it without
>>>>>> clear
>>>>>> evidence that the extra lock has a performance improvement that's
>>>>>> worth
>>>>>> the cost of the extra complexity.
>>>>>
>>>>> I think complexity is lower this way. Especially considering the per-
>>>>> scheduler lock changing with moving a cpu to or from a cpupool.
>>>>
>>>> The key aspect of the per-scheduler lock is that once you hold it, the
>>>> pointer to the lock can't change.
>>>>
>>>> After this patch, the fact remains that sometimes you need to grab one
>>>> lock, sometimes the other, and sometimes both.
>>>>
>>>> And, tick_suspend() lives in the per-scheduler code.  Each scheduler
>>>> has
>>>> to remember that tick_suspend and tick_resume hold a completely
>>>> different lock to the rest of the scheduling functions.
>>>
>>> Is that really so critical? Today only credit1 has tick_suspend and
>>> tick_resume hooks, and both are really very simple. I can add a
>>> comment in sched-if.h if you like.
>>>
>>> And up to now there was no lock at all involved when calling them...
>>>
>>> If you think using the normal scheduler lock is to be preferred I'd
>>> be happy to change the patch. But I should mention I was already
>>> planning to revisit usage of the scheduler lock and replace it by the
>>> new per-cpu lock where appropriate (not sure I'd find any appropriate
>>> path for replacement).
>>
>> Well the really annoying thing here is that all the other schedulers --
>> in particular, credit2, which as you say, is designed to have multiple
>> runqueues share the same lock -- have to grab & release the lock just to
>> find out that there's nothing to do.
>>
>> And even credit1 doesn't do anything particularly clever -- all it does
>> is stop and start a timer based on a scheduler-global configuration. And
>> the scheduling lock is grabbed to switch to idle anyway.  It seems like
>> we should be able to do something more sensible.
> 
> Yeah, I thought the same.

I can think of a couple of options:

1. Have schedule.c call s->tick_* when switching to / from idle

2. Get rid of s->tick_*, and have sched_credit.c suspend / resume ticks
when switching to / from idle in csched_schedule()

3. Have schedule.c suspend / resume ticks, and have an interface that
allows schedulers to enable / disable them.

4. Rework sched_credit to be tickless.

 -George
Juergen Gross Oct. 6, 2019, 6:05 p.m. UTC | #13
On 04.10.19 18:09, George Dunlap wrote:
> On 10/4/19 4:40 PM, Jürgen Groß wrote:
>> On 04.10.19 17:37, George Dunlap wrote:
>>> On 10/4/19 4:03 PM, Jürgen Groß wrote:
>>>> On 04.10.19 16:56, George Dunlap wrote:
>>>>> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>>>>>> On 04.10.19 16:34, George Dunlap wrote:
>>>>>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>>>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>>>>>> scheduler specific timer handlers in case the cpu they are
>>>>>>>>>> running on
>>>>>>>>>> is just being moved to or from a cpupool.
>>>>>>>>>>
>>>>>>>>>> Use a new percpu lock for that purpose.
>>>>>>>>>
>>>>>>>>> Is there a reason we can't use the pcpu_schedule_lock() instead of
>>>>>>>>> introducing a new one?  Sorry if this is obvious, but it's been a
>>>>>>>>> while
>>>>>>>>> since I poked around this code.
>>>>>>>>
>>>>>>>> Lock contention would be higher especially with credit2 which is
>>>>>>>> using a
>>>>>>>> per-core or even per-socket lock. We don't care about other
>>>>>>>> scheduling
>>>>>>>> activity here, all we need is a guard against our per-cpu scheduler
>>>>>>>> data being changed beneath our feet.
>>>>>>>
>>>>>>> Is this code really being called so often that we need to worry about
>>>>>>> this level of contention?
>>>>>>
>>>>>> Its called each time idle is entered and left again.
>>>>>>
>>>>>> Especially with core scheduling there is a high probability of
>>>>>> multiple
>>>>>> cpus leaving idle at the same time and the per-scheduler lock being
>>>>>> used
>>>>>> in parallel already.
>>>>>
>>>>> Hrm, that does sound pretty bad.
>>>>>
>>>>>>> We already have a *lot* of locks; and in this case you're adding a
>>>>>>> second lock which interacts with the per-scheduler cpu lock.  This
>>>>>>> just
>>>>>>> seems like asking for trouble.
>>>>>>
>>>>>> In which way does it interact with the per-scheduler cpu lock?
>>>>>>
>>>>>>> I won't Nack the patch, but I don't think I would ack it without
>>>>>>> clear
>>>>>>> evidence that the extra lock has a performance improvement that's
>>>>>>> worth
>>>>>>> the cost of the extra complexity.
>>>>>>
>>>>>> I think complexity is lower this way. Especially considering the per-
>>>>>> scheduler lock changing with moving a cpu to or from a cpupool.
>>>>>
>>>>> The key aspect of the per-scheduler lock is that once you hold it, the
>>>>> pointer to the lock can't change.
>>>>>
>>>>> After this patch, the fact remains that sometimes you need to grab one
>>>>> lock, sometimes the other, and sometimes both.
>>>>>
>>>>> And, tick_suspend() lives in the per-scheduler code.  Each scheduler
>>>>> has
>>>>> to remember that tick_suspend and tick_resume hold a completely
>>>>> different lock to the rest of the scheduling functions.
>>>>
>>>> Is that really so critical? Today only credit1 has tick_suspend and
>>>> tick_resume hooks, and both are really very simple. I can add a
>>>> comment in sched-if.h if you like.
>>>>
>>>> And up to now there was no lock at all involved when calling them...
>>>>
>>>> If you think using the normal scheduler lock is to be preferred I'd
>>>> be happy to change the patch. But I should mention I was already
>>>> planning to revisit usage of the scheduler lock and replace it by the
>>>> new per-cpu lock where appropriate (not sure I'd find any appropriate
>>>> path for replacement).
>>>
>>> Well the really annoying thing here is that all the other schedulers --
>>> in particular, credit2, which as you say, is designed to have multiple
>>> runqueues share the same lock -- have to grab & release the lock just to
>>> find out that there's nothing to do.
>>>
>>> And even credit1 doesn't do anything particularly clever -- all it does
>>> is stop and start a timer based on a scheduler-global configuration. And
>>> the scheduling lock is grabbed to switch to idle anyway.  It seems like
>>> we should be able to do something more sensible.
>>
>> Yeah, I thought the same.
> 
> I can think of a couple of options:
> 
> 1. Have schedule.c call s->tick_* when switching to / from idle
> 
> 2. Get rid of s->tick_*, and have sched_credit.c suspend / resume ticks
> when switching to / from idle in csched_schedule()
> 
> 3. Have schedule.c suspend / resume ticks, and have an interface that
> allows schedulers to enable / disable them.
> 
> 4. Rework sched_credit to be tickless.

I'm going with 2., as it will have multiple advantages:

- not very intrusive
- keeps credit specifics in credit
- allows us to fix an existing bug in credit: if we have only one domain
   running in a cpupool with credit scheduler and this domain is capped
   today's handling will disable the credit tick in idle and there will
   be nobody unpausing the vcpus of the capped domain.

Will send patches soon.


Juergen
Juergen Gross Oct. 7, 2019, 6:09 a.m. UTC | #14
On 06.10.19 20:05, Jürgen Groß wrote:
> On 04.10.19 18:09, George Dunlap wrote:
>> On 10/4/19 4:40 PM, Jürgen Groß wrote:
>>> On 04.10.19 17:37, George Dunlap wrote:
>>>> On 10/4/19 4:03 PM, Jürgen Groß wrote:
>>>>> On 04.10.19 16:56, George Dunlap wrote:
>>>>>> On 10/4/19 3:43 PM, Jürgen Groß wrote:
>>>>>>> On 04.10.19 16:34, George Dunlap wrote:
>>>>>>>> On 10/4/19 3:24 PM, Jürgen Groß wrote:
>>>>>>>>> On 04.10.19 16:08, George Dunlap wrote:
>>>>>>>>>> On 10/4/19 7:40 AM, Juergen Gross wrote:
>>>>>>>>>>> sched_tick_suspend() and sched_tick_resume() should not call the
>>>>>>>>>>> scheduler specific timer handlers in case the cpu they are
>>>>>>>>>>> running on
>>>>>>>>>>> is just being moved to or from a cpupool.
>>>>>>>>>>>
>>>>>>>>>>> Use a new percpu lock for that purpose.
>>>>>>>>>>
>>>>>>>>>> Is there a reason we can't use the pcpu_schedule_lock() 
>>>>>>>>>> instead of
>>>>>>>>>> introducing a new one?  Sorry if this is obvious, but it's been a
>>>>>>>>>> while
>>>>>>>>>> since I poked around this code.
>>>>>>>>>
>>>>>>>>> Lock contention would be higher especially with credit2 which is
>>>>>>>>> using a
>>>>>>>>> per-core or even per-socket lock. We don't care about other
>>>>>>>>> scheduling
>>>>>>>>> activity here, all we need is a guard against our per-cpu 
>>>>>>>>> scheduler
>>>>>>>>> data being changed beneath our feet.
>>>>>>>>
>>>>>>>> Is this code really being called so often that we need to worry 
>>>>>>>> about
>>>>>>>> this level of contention?
>>>>>>>
>>>>>>> Its called each time idle is entered and left again.
>>>>>>>
>>>>>>> Especially with core scheduling there is a high probability of
>>>>>>> multiple
>>>>>>> cpus leaving idle at the same time and the per-scheduler lock being
>>>>>>> used
>>>>>>> in parallel already.
>>>>>>
>>>>>> Hrm, that does sound pretty bad.
>>>>>>
>>>>>>>> We already have a *lot* of locks; and in this case you're adding a
>>>>>>>> second lock which interacts with the per-scheduler cpu lock.  This
>>>>>>>> just
>>>>>>>> seems like asking for trouble.
>>>>>>>
>>>>>>> In which way does it interact with the per-scheduler cpu lock?
>>>>>>>
>>>>>>>> I won't Nack the patch, but I don't think I would ack it without
>>>>>>>> clear
>>>>>>>> evidence that the extra lock has a performance improvement that's
>>>>>>>> worth
>>>>>>>> the cost of the extra complexity.
>>>>>>>
>>>>>>> I think complexity is lower this way. Especially considering the 
>>>>>>> per-
>>>>>>> scheduler lock changing with moving a cpu to or from a cpupool.
>>>>>>
>>>>>> The key aspect of the per-scheduler lock is that once you hold it, 
>>>>>> the
>>>>>> pointer to the lock can't change.
>>>>>>
>>>>>> After this patch, the fact remains that sometimes you need to grab 
>>>>>> one
>>>>>> lock, sometimes the other, and sometimes both.
>>>>>>
>>>>>> And, tick_suspend() lives in the per-scheduler code.  Each scheduler
>>>>>> has
>>>>>> to remember that tick_suspend and tick_resume hold a completely
>>>>>> different lock to the rest of the scheduling functions.
>>>>>
>>>>> Is that really so critical? Today only credit1 has tick_suspend and
>>>>> tick_resume hooks, and both are really very simple. I can add a
>>>>> comment in sched-if.h if you like.
>>>>>
>>>>> And up to now there was no lock at all involved when calling them...
>>>>>
>>>>> If you think using the normal scheduler lock is to be preferred I'd
>>>>> be happy to change the patch. But I should mention I was already
>>>>> planning to revisit usage of the scheduler lock and replace it by the
>>>>> new per-cpu lock where appropriate (not sure I'd find any appropriate
>>>>> path for replacement).
>>>>
>>>> Well the really annoying thing here is that all the other schedulers --
>>>> in particular, credit2, which as you say, is designed to have multiple
>>>> runqueues share the same lock -- have to grab & release the lock 
>>>> just to
>>>> find out that there's nothing to do.
>>>>
>>>> And even credit1 doesn't do anything particularly clever -- all it does
>>>> is stop and start a timer based on a scheduler-global configuration. 
>>>> And
>>>> the scheduling lock is grabbed to switch to idle anyway.  It seems like
>>>> we should be able to do something more sensible.
>>>
>>> Yeah, I thought the same.
>>
>> I can think of a couple of options:
>>
>> 1. Have schedule.c call s->tick_* when switching to / from idle
>>
>> 2. Get rid of s->tick_*, and have sched_credit.c suspend / resume ticks
>> when switching to / from idle in csched_schedule()
>>
>> 3. Have schedule.c suspend / resume ticks, and have an interface that
>> allows schedulers to enable / disable them.
>>
>> 4. Rework sched_credit to be tickless.
> 
> I'm going with 2., as it will have multiple advantages:
> 
> - not very intrusive
> - keeps credit specifics in credit
> - allows us to fix an existing bug in credit: if we have only one domain
>    running in a cpupool with credit scheduler and this domain is capped
>    today's handling will disable the credit tick in idle and there will
>    be nobody unpausing the vcpus of the capped domain.

This is not an issue, I managed to mix up timer and master_timer of the
credit scheduler.


Juergen
Dario Faggioli Oct. 7, 2019, 8:49 a.m. UTC | #15
On Sun, 2019-10-06 at 20:05 +0200, Jürgen Groß wrote:
> On 04.10.19 18:09, George Dunlap wrote:
> > 
> > I can think of a couple of options:
> > 
> > 1. Have schedule.c call s->tick_* when switching to / from idle
> > 
> > 2. Get rid of s->tick_*, and have sched_credit.c suspend / resume
> > ticks
> > when switching to / from idle in csched_schedule()
> > 
> > 3. Have schedule.c suspend / resume ticks, and have an interface
> > that
> > allows schedulers to enable / disable them.
> > 
> > 4. Rework sched_credit to be tickless.
> 
> I'm going with 2., as it will have multiple advantages:
> 
Good choice! For these reasons:

> - not very intrusive
> - keeps credit specifics in credit
>
And also because, if you'd go for 4, I'm sure that reviewing something
like that would would cause me nightmares! :-O

Thanks and Regards
diff mbox series

Patch

diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 217fcb09ce..744f8cb5db 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -68,6 +68,9 @@  cpumask_t sched_res_mask;
 /* Common lock for free cpus. */
 static DEFINE_SPINLOCK(sched_free_cpu_lock);
 
+/* Lock for guarding per-scheduler calls against scheduler changes on a cpu. */
+static DEFINE_PER_CPU(spinlock_t, sched_cpu_lock);
+
 /* Various timer handlers. */
 static void s_timer_fn(void *unused);
 static void vcpu_periodic_timer_fn(void *data);
@@ -2472,6 +2475,8 @@  static int cpu_schedule_up(unsigned int cpu)
     if ( sr == NULL )
         return -ENOMEM;
 
+    spin_lock_init(&per_cpu(sched_cpu_lock, cpu));
+
     sr->master_cpu = cpu;
     cpumask_copy(sr->cpus, cpumask_of(cpu));
     set_sched_res(cpu, sr);
@@ -2763,11 +2768,14 @@  int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
     struct scheduler *new_ops = c->sched;
     struct sched_resource *sr;
     spinlock_t *old_lock, *new_lock;
+    spinlock_t *cpu_lock = &per_cpu(sched_cpu_lock, cpu);
     unsigned long flags;
     int ret = 0;
 
     rcu_read_lock(&sched_res_rculock);
 
+    spin_lock(cpu_lock);
+
     sr = get_sched_res(cpu);
 
     ASSERT(cpumask_test_cpu(cpu, &cpupool_free_cpus));
@@ -2879,6 +2887,8 @@  int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
     cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);
 
 out:
+    spin_unlock(cpu_lock);
+
     rcu_read_unlock(&sched_res_rculock);
 
     return ret;
@@ -2897,12 +2907,15 @@  int schedule_cpu_rm(unsigned int cpu)
     struct sched_unit *unit;
     struct scheduler *old_ops;
     spinlock_t *old_lock;
+    spinlock_t *cpu_lock = &per_cpu(sched_cpu_lock, cpu);
     unsigned long flags;
     int idx, ret = -ENOMEM;
     unsigned int cpu_iter;
 
     rcu_read_lock(&sched_res_rculock);
 
+    spin_lock(cpu_lock);
+
     sr = get_sched_res(cpu);
     old_ops = sr->scheduler;
 
@@ -3004,6 +3017,8 @@  int schedule_cpu_rm(unsigned int cpu)
     sr->cpupool = NULL;
 
 out:
+    spin_unlock(cpu_lock);
+
     rcu_read_unlock(&sched_res_rculock);
     xfree(sr_new);
 
@@ -3084,11 +3099,17 @@  void sched_tick_suspend(void)
 {
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
+    spinlock_t *lock = &per_cpu(sched_cpu_lock, cpu);
 
     rcu_read_lock(&sched_res_rculock);
 
+    spin_lock(lock);
+
     sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_suspend(sched, cpu);
+
+    spin_unlock(lock);
+
     rcu_idle_enter(cpu);
     rcu_idle_timer_start();
 
@@ -3099,14 +3120,20 @@  void sched_tick_resume(void)
 {
     struct scheduler *sched;
     unsigned int cpu = smp_processor_id();
+    spinlock_t *lock = &per_cpu(sched_cpu_lock, cpu);
 
     rcu_read_lock(&sched_res_rculock);
 
     rcu_idle_timer_stop();
     rcu_idle_exit(cpu);
+
+    spin_lock(lock);
+
     sched = get_sched_res(cpu)->scheduler;
     sched_do_tick_resume(sched, cpu);
 
+    spin_unlock(lock);
+
     rcu_read_unlock(&sched_res_rculock);
 }