diff mbox series

[RFC,V2,42/45] xen/sched: add fall back to idle vcpu when scheduling item

Message ID 20190506065644.7415-43-jgross@suse.com (mailing list archive)
State Superseded
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß May 6, 2019, 6:56 a.m. UTC
When scheduling an item with multiple vcpus there is no guarantee all
vcpus are available (e.g. above maxvcpus or vcpu offline). Fall back to
idle vcpu of the current cpu in that case. This requires to store the
correct schedule_item pointer in the idle vcpu as long as it used as
fallback vcpu.

In order to modify the runstates of the correct vcpus when switching
schedule items merge sched_item_runstate_change() into
sched_switch_items() and loop over the affected physical cpus instead
of the item's vcpus. This in turn requires an access function to the
current variable of other cpus.

Today context_saved() is called in case previous and next vcpus differ
when doing a context switch. With an idle vcpu being capable to be a
substitute for an offline vcpu this is problematic when switching to
an idle scheduling item. An idle previous vcpu leaves us in doubt which
schedule item was active previously, so save the previous item pointer
in the per-schedule resource area and use its value being non-NULL as
a hint whether context_saved() should be called.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
RFC V2: new patch (Andrew Cooper)
---
 xen/arch/x86/domain.c         |  21 ++++++
 xen/common/schedule.c         | 153 +++++++++++++++++++++++++++---------------
 xen/include/asm-arm/current.h |   1 +
 xen/include/asm-x86/current.h |   7 +-
 xen/include/asm-x86/smp.h     |   3 +
 xen/include/xen/sched-if.h    |   4 +-
 6 files changed, 134 insertions(+), 55 deletions(-)

Comments

Jan Beulich May 16, 2019, 1:05 p.m. UTC | #1
>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
> --- a/xen/arch/x86/domain.c
> +++ b/xen/arch/x86/domain.c
> @@ -154,6 +154,24 @@ static void idle_loop(void)
>      }
>  }
>  
> +/*
> + * Idle loop for siblings of active schedule items.
> + * We don't do any standard idle work like tasklets, page scrubbing or
> + * livepatching.
> + * Use default_idle() in order to simulate v->is_urgent.

I guess I'm missing a part of the description which explains all this:
What's wrong with doing scrubbing work, for example? Why is
doing tasklet work not okay, but softirqs are? What is the deal with
v->is_urgent, i.e. what justifies not entering a decent power
saving mode here on Intel, but doing so on AMD?

> --- a/xen/include/asm-x86/smp.h
> +++ b/xen/include/asm-x86/smp.h
> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>  /* Representing HT and core siblings in each socket. */
>  extern cpumask_t **socket_cpumask;
>  
> +#define get_cpu_current(cpu) \
> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)

Yet another, slightly different notion of "current". If "current"
itself is not suitable (I can't immediately see why that would be,
but I also didn't look at all the scheduler specific changes earlier
in this series), why isn't per_cpu(curr_vcpu, cpu) either?

Jan
Jürgen Groß May 16, 2019, 1:51 p.m. UTC | #2
On 16/05/2019 15:05, Jan Beulich wrote:
>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
>> --- a/xen/arch/x86/domain.c
>> +++ b/xen/arch/x86/domain.c
>> @@ -154,6 +154,24 @@ static void idle_loop(void)
>>      }
>>  }
>>  
>> +/*
>> + * Idle loop for siblings of active schedule items.
>> + * We don't do any standard idle work like tasklets, page scrubbing or
>> + * livepatching.
>> + * Use default_idle() in order to simulate v->is_urgent.
> 
> I guess I'm missing a part of the description which explains all this:
> What's wrong with doing scrubbing work, for example? Why is
> doing tasklet work not okay, but softirqs are? What is the deal with
> v->is_urgent, i.e. what justifies not entering a decent power
> saving mode here on Intel, but doing so on AMD?

One of the reasons for using core scheduling is to avoid running vcpus
of different domains on the same core in order to minimize the chances
for side channel attacks to data of other domains. Not allowing
scrubbing or tasklets here is due to avoid accessing data of other
domains.

As with core scheduling we can be sure the other thread is active
(otherwise we would schedule the idle item) and hoping for saving power
by using mwait is moot.

> 
>> --- a/xen/include/asm-x86/smp.h
>> +++ b/xen/include/asm-x86/smp.h
>> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>>  /* Representing HT and core siblings in each socket. */
>>  extern cpumask_t **socket_cpumask;
>>  
>> +#define get_cpu_current(cpu) \
>> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
> 
> Yet another, slightly different notion of "current". If "current"
> itself is not suitable (I can't immediately see why that would be,
> but I also didn't look at all the scheduler specific changes earlier
> in this series), why isn't per_cpu(curr_vcpu, cpu) either?

current is always the vcpu running on the current physical cpu.
curr_vcpu is the vcpu which was the one running in guest mode last
(this avoids the need to save/restore context in case a vcpu is
blocked for a short time without another guest vcpu running on the
physical cpu in between), so with current being idle the two can
differ.

Here I need "current" from another physical cpu which is not easily
available.


Juergen
Jan Beulich May 16, 2019, 2:41 p.m. UTC | #3
>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
> On 16/05/2019 15:05, Jan Beulich wrote:
>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
>>> --- a/xen/arch/x86/domain.c
>>> +++ b/xen/arch/x86/domain.c
>>> @@ -154,6 +154,24 @@ static void idle_loop(void)
>>>      }
>>>  }
>>>  
>>> +/*
>>> + * Idle loop for siblings of active schedule items.
>>> + * We don't do any standard idle work like tasklets, page scrubbing or
>>> + * livepatching.
>>> + * Use default_idle() in order to simulate v->is_urgent.
>> 
>> I guess I'm missing a part of the description which explains all this:
>> What's wrong with doing scrubbing work, for example? Why is
>> doing tasklet work not okay, but softirqs are? What is the deal with
>> v->is_urgent, i.e. what justifies not entering a decent power
>> saving mode here on Intel, but doing so on AMD?
> 
> One of the reasons for using core scheduling is to avoid running vcpus
> of different domains on the same core in order to minimize the chances
> for side channel attacks to data of other domains. Not allowing
> scrubbing or tasklets here is due to avoid accessing data of other
> domains.

So how is running softirqs okay then? And how is scrubbing accessing
other domains' data?

> As with core scheduling we can be sure the other thread is active
> (otherwise we would schedule the idle item) and hoping for saving power
> by using mwait is moot.

Saving power may be indirect, by the CPU re-arranging
resource assignment between threads when one goes idle.
I have no idea whether they do this when entering C1, or
only when entering deeper C states.

And anyway - I'm still none the wiser as to the v->is_urgent
relationship.

>>> --- a/xen/include/asm-x86/smp.h
>>> +++ b/xen/include/asm-x86/smp.h
>>> @@ -76,6 +76,9 @@ void set_nr_sockets(void);
>>>  /* Representing HT and core siblings in each socket. */
>>>  extern cpumask_t **socket_cpumask;
>>>  
>>> +#define get_cpu_current(cpu) \
>>> +    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
>> 
>> Yet another, slightly different notion of "current". If "current"
>> itself is not suitable (I can't immediately see why that would be,
>> but I also didn't look at all the scheduler specific changes earlier
>> in this series), why isn't per_cpu(curr_vcpu, cpu) either?
> 
> current is always the vcpu running on the current physical cpu.
> curr_vcpu is the vcpu which was the one running in guest mode last
> (this avoids the need to save/restore context in case a vcpu is
> blocked for a short time without another guest vcpu running on the
> physical cpu in between), so with current being idle the two can
> differ.
> 
> Here I need "current" from another physical cpu which is not easily
> available.

Oh, right - I should have been able to spot this.

Jan
Jürgen Groß May 17, 2019, 5:13 a.m. UTC | #4
On 16/05/2019 16:41, Jan Beulich wrote:
>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
>> On 16/05/2019 15:05, Jan Beulich wrote:
>>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
>>>> --- a/xen/arch/x86/domain.c
>>>> +++ b/xen/arch/x86/domain.c
>>>> @@ -154,6 +154,24 @@ static void idle_loop(void)
>>>>      }
>>>>  }
>>>>  
>>>> +/*
>>>> + * Idle loop for siblings of active schedule items.
>>>> + * We don't do any standard idle work like tasklets, page scrubbing or
>>>> + * livepatching.
>>>> + * Use default_idle() in order to simulate v->is_urgent.
>>>
>>> I guess I'm missing a part of the description which explains all this:
>>> What's wrong with doing scrubbing work, for example? Why is
>>> doing tasklet work not okay, but softirqs are? What is the deal with
>>> v->is_urgent, i.e. what justifies not entering a decent power
>>> saving mode here on Intel, but doing so on AMD?
>>
>> One of the reasons for using core scheduling is to avoid running vcpus
>> of different domains on the same core in order to minimize the chances
>> for side channel attacks to data of other domains. Not allowing
>> scrubbing or tasklets here is due to avoid accessing data of other
>> domains.
> 
> So how is running softirqs okay then? And how is scrubbing accessing
> other domains' data?

Right now I'm not sure whether it is a good idea to block any softirqs.
We definitely need to process scheduling requests and I believe RCU and
tasklets, too. The tlbflush one should be uncritical, so timers is the
remaining one which might be questionable. This can be fine-tuned later
IMO e.g. by defining a softirq mask of critical softirqs to block and
eventually splitting up e.g. timer and tasklet softirqs into critical
and uncritical ones.

Scrubbing will probably pull the cache lines of the dirty pages into
the L1 cache of the cpu. For me this sounds problematic. In case we
are fine to do scrubbing as there is no risk associated I'm fine to add
it back in.

>> As with core scheduling we can be sure the other thread is active
>> (otherwise we would schedule the idle item) and hoping for saving power
>> by using mwait is moot.
> 
> Saving power may be indirect, by the CPU re-arranging
> resource assignment between threads when one goes idle.
> I have no idea whether they do this when entering C1, or
> only when entering deeper C states.

SDM Vol. 3 chapter 8.10.1 "HLT instruction":

"Here shared resources that were being used by the halted logical
processor become available to active logical processors, allowing them
to execute at greater efficiency."

> And anyway - I'm still none the wiser as to the v->is_urgent
> relationship.

With v->is_urgent set today's idle loop will drop into default_idle().
I can remove this sentence in case it is just confusing.


Juergen
Jan Beulich May 17, 2019, 6:57 a.m. UTC | #5
>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote:
> On 16/05/2019 16:41, Jan Beulich wrote:
>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
>>> On 16/05/2019 15:05, Jan Beulich wrote:
>>>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
>>>>> --- a/xen/arch/x86/domain.c
>>>>> +++ b/xen/arch/x86/domain.c
>>>>> @@ -154,6 +154,24 @@ static void idle_loop(void)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +/*
>>>>> + * Idle loop for siblings of active schedule items.
>>>>> + * We don't do any standard idle work like tasklets, page scrubbing or
>>>>> + * livepatching.
>>>>> + * Use default_idle() in order to simulate v->is_urgent.
>>>>
>>>> I guess I'm missing a part of the description which explains all this:
>>>> What's wrong with doing scrubbing work, for example? Why is
>>>> doing tasklet work not okay, but softirqs are? What is the deal with
>>>> v->is_urgent, i.e. what justifies not entering a decent power
>>>> saving mode here on Intel, but doing so on AMD?
>>>
>>> One of the reasons for using core scheduling is to avoid running vcpus
>>> of different domains on the same core in order to minimize the chances
>>> for side channel attacks to data of other domains. Not allowing
>>> scrubbing or tasklets here is due to avoid accessing data of other
>>> domains.
>> 
>> So how is running softirqs okay then? And how is scrubbing accessing
>> other domains' data?
> 
> Right now I'm not sure whether it is a good idea to block any softirqs.
> We definitely need to process scheduling requests and I believe RCU and
> tasklets, too. The tlbflush one should be uncritical, so timers is the
> remaining one which might be questionable. This can be fine-tuned later
> IMO e.g. by defining a softirq mask of critical softirqs to block and
> eventually splitting up e.g. timer and tasklet softirqs into critical
> and uncritical ones.

Well, okay, but please add an abridged version of this to the patch
description then.

> Scrubbing will probably pull the cache lines of the dirty pages into
> the L1 cache of the cpu. For me this sounds problematic. In case we
> are fine to do scrubbing as there is no risk associated I'm fine to add
> it back in.

Well, of course there's going to be a brief period of time where
a cache line will be present in CPU internal buffers (it's not just the
cache after all, as we've learned with XSA-297). So I can certainly
buy that when using core granularity you don't want to scrub on
the other thread. But what about the socket granularity case?
Scrubbing on fully idle cores should still be fine, I would think.

>>> As with core scheduling we can be sure the other thread is active
>>> (otherwise we would schedule the idle item) and hoping for saving power
>>> by using mwait is moot.
>> 
>> Saving power may be indirect, by the CPU re-arranging
>> resource assignment between threads when one goes idle.
>> I have no idea whether they do this when entering C1, or
>> only when entering deeper C states.
> 
> SDM Vol. 3 chapter 8.10.1 "HLT instruction":
> 
> "Here shared resources that were being used by the halted logical
> processor become available to active logical processors, allowing them
> to execute at greater efficiency."

To be honest, this is to broad/generic a statement to fully
trust it, judging from other areas of the SDM. And then, as
per above, what about the socket granularity case? Putting
entirely idle cores to sleep is surely worthwhile?

>> And anyway - I'm still none the wiser as to the v->is_urgent
>> relationship.
> 
> With v->is_urgent set today's idle loop will drop into default_idle().
> I can remove this sentence in case it is just confusing.

I'd prefer if the connection would become more obvious. One
needs to go from ->is_urgent via ->urgent_count to
sched_has_urgent_vcpu() to find where the described
behavior really lives.

What's worse though: This won't work as intended on AMD
at all. I don't think it's correct to fall back to default_idle() in
this case. Instead sched_has_urgent_vcpu() returning true
should amount to the same effect as max_cstate being set
to 1. There's
(a) no reason not to use MWAIT on Intel CPUs in this case,
if MWAIT can enter C1, and
(b) a strong need to use MWAIT on (at least) AMD Fam17,
or else it won't be C1 that gets entered.
I'll see about making a patch in due course.

Jan
Jürgen Groß May 17, 2019, 7:48 a.m. UTC | #6
On 17/05/2019 08:57, Jan Beulich wrote:
>>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote:
>> On 16/05/2019 16:41, Jan Beulich wrote:
>>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
>>>> On 16/05/2019 15:05, Jan Beulich wrote:
>>>>>>>> On 06.05.19 at 08:56, <jgross@suse.com> wrote:
>>>>>> --- a/xen/arch/x86/domain.c
>>>>>> +++ b/xen/arch/x86/domain.c
>>>>>> @@ -154,6 +154,24 @@ static void idle_loop(void)
>>>>>>      }
>>>>>>  }
>>>>>>  
>>>>>> +/*
>>>>>> + * Idle loop for siblings of active schedule items.
>>>>>> + * We don't do any standard idle work like tasklets, page scrubbing or
>>>>>> + * livepatching.
>>>>>> + * Use default_idle() in order to simulate v->is_urgent.
>>>>>
>>>>> I guess I'm missing a part of the description which explains all this:
>>>>> What's wrong with doing scrubbing work, for example? Why is
>>>>> doing tasklet work not okay, but softirqs are? What is the deal with
>>>>> v->is_urgent, i.e. what justifies not entering a decent power
>>>>> saving mode here on Intel, but doing so on AMD?
>>>>
>>>> One of the reasons for using core scheduling is to avoid running vcpus
>>>> of different domains on the same core in order to minimize the chances
>>>> for side channel attacks to data of other domains. Not allowing
>>>> scrubbing or tasklets here is due to avoid accessing data of other
>>>> domains.
>>>
>>> So how is running softirqs okay then? And how is scrubbing accessing
>>> other domains' data?
>>
>> Right now I'm not sure whether it is a good idea to block any softirqs.
>> We definitely need to process scheduling requests and I believe RCU and
>> tasklets, too. The tlbflush one should be uncritical, so timers is the
>> remaining one which might be questionable. This can be fine-tuned later
>> IMO e.g. by defining a softirq mask of critical softirqs to block and
>> eventually splitting up e.g. timer and tasklet softirqs into critical
>> and uncritical ones.
> 
> Well, okay, but please add an abridged version of this to the patch
> description then.

Okay.

> 
>> Scrubbing will probably pull the cache lines of the dirty pages into
>> the L1 cache of the cpu. For me this sounds problematic. In case we
>> are fine to do scrubbing as there is no risk associated I'm fine to add
>> it back in.
> 
> Well, of course there's going to be a brief period of time where
> a cache line will be present in CPU internal buffers (it's not just the
> cache after all, as we've learned with XSA-297). So I can certainly
> buy that when using core granularity you don't want to scrub on
> the other thread. But what about the socket granularity case?
> Scrubbing on fully idle cores should still be fine, I would think.

I think this would depend on the reason for selecting socket scheduling.
I'd at least would want to have a way to select that as I could think of
e.g. L3-cache side channel attacks, too.

So maybe I could add a patch on top for adding a sub-option to the
sched-gran parameter which will allow (or disallow?) scrubbing on idle
cores or threads.

> 
>>>> As with core scheduling we can be sure the other thread is active
>>>> (otherwise we would schedule the idle item) and hoping for saving power
>>>> by using mwait is moot.
>>>
>>> Saving power may be indirect, by the CPU re-arranging
>>> resource assignment between threads when one goes idle.
>>> I have no idea whether they do this when entering C1, or
>>> only when entering deeper C states.
>>
>> SDM Vol. 3 chapter 8.10.1 "HLT instruction":
>>
>> "Here shared resources that were being used by the halted logical
>> processor become available to active logical processors, allowing them
>> to execute at greater efficiency."
> 
> To be honest, this is to broad/generic a statement to fully
> trust it, judging from other areas of the SDM. And then, as
> per above, what about the socket granularity case? Putting
> entirely idle cores to sleep is surely worthwhile?

Yes, I assume it is. OTOH this might affect context switches badly
as the reaction time for the coordinated switch will rise. Maybe a
good reason for another sub-option?

>>> And anyway - I'm still none the wiser as to the v->is_urgent
>>> relationship.
>>
>> With v->is_urgent set today's idle loop will drop into default_idle().
>> I can remove this sentence in case it is just confusing.
> 
> I'd prefer if the connection would become more obvious. One
> needs to go from ->is_urgent via ->urgent_count to
> sched_has_urgent_vcpu() to find where the described
> behavior really lives.
> 
> What's worse though: This won't work as intended on AMD
> at all. I don't think it's correct to fall back to default_idle() in
> this case. Instead sched_has_urgent_vcpu() returning true
> should amount to the same effect as max_cstate being set
> to 1. There's
> (a) no reason not to use MWAIT on Intel CPUs in this case,
> if MWAIT can enter C1, and
> (b) a strong need to use MWAIT on (at least) AMD Fam17,
> or else it won't be C1 that gets entered.
> I'll see about making a patch in due course.

Thanks. Would you mind doing it in a way that the caller can specify
max_cstate? This would remove the need to call sched_has_urgent_vcpu()
deep down the idle handling and I could re-use it for my purpose.


Juergen
Jan Beulich May 17, 2019, 8:22 a.m. UTC | #7
>>> On 17.05.19 at 09:48, <jgross@suse.com> wrote:
> On 17/05/2019 08:57, Jan Beulich wrote:
>>>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote:
>>> On 16/05/2019 16:41, Jan Beulich wrote:
>>>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
>>>>> As with core scheduling we can be sure the other thread is active
>>>>> (otherwise we would schedule the idle item) and hoping for saving power
>>>>> by using mwait is moot.
>>>>
>>>> Saving power may be indirect, by the CPU re-arranging
>>>> resource assignment between threads when one goes idle.
>>>> I have no idea whether they do this when entering C1, or
>>>> only when entering deeper C states.
>>>
>>> SDM Vol. 3 chapter 8.10.1 "HLT instruction":
>>>
>>> "Here shared resources that were being used by the halted logical
>>> processor become available to active logical processors, allowing them
>>> to execute at greater efficiency."
>> 
>> To be honest, this is to broad/generic a statement to fully
>> trust it, judging from other areas of the SDM. And then, as
>> per above, what about the socket granularity case? Putting
>> entirely idle cores to sleep is surely worthwhile?
> 
> Yes, I assume it is. OTOH this might affect context switches badly
> as the reaction time for the coordinated switch will rise. Maybe a
> good reason for another sub-option?

While I agree that fine grained control is useful, I'm seeing an
increasing risk of there going to be too many controls to actually
be certain in the end that all possible combinations work
correctly.

>>>> And anyway - I'm still none the wiser as to the v->is_urgent
>>>> relationship.
>>>
>>> With v->is_urgent set today's idle loop will drop into default_idle().
>>> I can remove this sentence in case it is just confusing.
>> 
>> I'd prefer if the connection would become more obvious. One
>> needs to go from ->is_urgent via ->urgent_count to
>> sched_has_urgent_vcpu() to find where the described
>> behavior really lives.
>> 
>> What's worse though: This won't work as intended on AMD
>> at all. I don't think it's correct to fall back to default_idle() in
>> this case. Instead sched_has_urgent_vcpu() returning true
>> should amount to the same effect as max_cstate being set
>> to 1. There's
>> (a) no reason not to use MWAIT on Intel CPUs in this case,
>> if MWAIT can enter C1, and
>> (b) a strong need to use MWAIT on (at least) AMD Fam17,
>> or else it won't be C1 that gets entered.
>> I'll see about making a patch in due course.
> 
> Thanks. Would you mind doing it in a way that the caller can specify
> max_cstate? This would remove the need to call sched_has_urgent_vcpu()
> deep down the idle handling and I could re-use it for my purpose.

Hmm, to be honest I'm not fancying giving a parameter to
default_idle(), pm_idle(), and friends. Conceptually it is not
the business of the callers to control the C states to be used.

What about the opposite: You simply mark idle (v)CPUs in
question as "urgent", thus achieving the intended effect as
well.

Jan
Jürgen Groß May 17, 2019, 8:57 a.m. UTC | #8
On 17/05/2019 10:22, Jan Beulich wrote:
>>>> On 17.05.19 at 09:48, <jgross@suse.com> wrote:
>> On 17/05/2019 08:57, Jan Beulich wrote:
>>>>>> On 17.05.19 at 07:13, <jgross@suse.com> wrote:
>>>> On 16/05/2019 16:41, Jan Beulich wrote:
>>>>>>>> On 16.05.19 at 15:51, <jgross@suse.com> wrote:
>>>>>> As with core scheduling we can be sure the other thread is active
>>>>>> (otherwise we would schedule the idle item) and hoping for saving power
>>>>>> by using mwait is moot.
>>>>>
>>>>> Saving power may be indirect, by the CPU re-arranging
>>>>> resource assignment between threads when one goes idle.
>>>>> I have no idea whether they do this when entering C1, or
>>>>> only when entering deeper C states.
>>>>
>>>> SDM Vol. 3 chapter 8.10.1 "HLT instruction":
>>>>
>>>> "Here shared resources that were being used by the halted logical
>>>> processor become available to active logical processors, allowing them
>>>> to execute at greater efficiency."
>>>
>>> To be honest, this is to broad/generic a statement to fully
>>> trust it, judging from other areas of the SDM. And then, as
>>> per above, what about the socket granularity case? Putting
>>> entirely idle cores to sleep is surely worthwhile?
>>
>> Yes, I assume it is. OTOH this might affect context switches badly
>> as the reaction time for the coordinated switch will rise. Maybe a
>> good reason for another sub-option?
> 
> While I agree that fine grained control is useful, I'm seeing an
> increasing risk of there going to be too many controls to actually
> be certain in the end that all possible combinations work
> correctly.

Okay, I think I'll leave it as is for the moment and do some
performance tests later. Depending on the results we can then
decide how to proceed.

> 
>>>>> And anyway - I'm still none the wiser as to the v->is_urgent
>>>>> relationship.
>>>>
>>>> With v->is_urgent set today's idle loop will drop into default_idle().
>>>> I can remove this sentence in case it is just confusing.
>>>
>>> I'd prefer if the connection would become more obvious. One
>>> needs to go from ->is_urgent via ->urgent_count to
>>> sched_has_urgent_vcpu() to find where the described
>>> behavior really lives.
>>>
>>> What's worse though: This won't work as intended on AMD
>>> at all. I don't think it's correct to fall back to default_idle() in
>>> this case. Instead sched_has_urgent_vcpu() returning true
>>> should amount to the same effect as max_cstate being set
>>> to 1. There's
>>> (a) no reason not to use MWAIT on Intel CPUs in this case,
>>> if MWAIT can enter C1, and
>>> (b) a strong need to use MWAIT on (at least) AMD Fam17,
>>> or else it won't be C1 that gets entered.
>>> I'll see about making a patch in due course.
>>
>> Thanks. Would you mind doing it in a way that the caller can specify
>> max_cstate? This would remove the need to call sched_has_urgent_vcpu()
>> deep down the idle handling and I could re-use it for my purpose.
> 
> Hmm, to be honest I'm not fancying giving a parameter to
> default_idle(), pm_idle(), and friends. Conceptually it is not
> the business of the callers to control the C states to be used.
> 
> What about the opposite: You simply mark idle (v)CPUs in
> question as "urgent", thus achieving the intended effect as
> well.

I can do that, of course. IMO letting the caller specify how he wants
idle to behave is cleaner than burying such an implicit dependency deep
down the code, but you are the maintainer of that part.


Juergen
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index d04e704116..f3dbca5dba 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -154,6 +154,24 @@  static void idle_loop(void)
     }
 }
 
+/*
+ * Idle loop for siblings of active schedule items.
+ * We don't do any standard idle work like tasklets, page scrubbing or
+ * livepatching.
+ * Use default_idle() in order to simulate v->is_urgent.
+ */
+static void guest_idle_loop(void)
+{
+    unsigned int cpu = smp_processor_id();
+
+    for ( ; ; )
+    {
+        if ( !softirq_pending(cpu) )
+            default_idle();
+        do_softirq();
+    }
+}
+
 void startup_cpu_idle_loop(void)
 {
     struct vcpu *v = current;
@@ -167,6 +185,9 @@  void startup_cpu_idle_loop(void)
 
 static void noreturn continue_idle_domain(struct vcpu *v)
 {
+    if ( !is_idle_item(v->sched_item) )
+        reset_stack_and_jump(guest_idle_loop);
+
     reset_stack_and_jump(idle_loop);
 }
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 0de199ccc9..788ecc9e81 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -82,7 +82,18 @@  static struct scheduler __read_mostly ops;
 static inline struct vcpu *sched_item2vcpu_cpu(struct sched_item *item,
                                                unsigned int cpu)
 {
-    return item->domain->vcpu[item->item_id + per_cpu(sched_res_idx, cpu)];
+    unsigned int idx = item->item_id + per_cpu(sched_res_idx, cpu);
+    const struct domain *d = item->domain;
+    struct vcpu *v;
+
+    if ( idx < d->max_vcpus && d->vcpu[idx] )
+    {
+        v = d->vcpu[idx];
+        if ( v->new_state == RUNSTATE_running )
+            return v;
+    }
+
+    return idle_vcpu[cpu];
 }
 
 static inline struct scheduler *dom_scheduler(const struct domain *d)
@@ -196,8 +207,11 @@  static inline void vcpu_runstate_change(
 
     trace_runstate_change(v, new_state);
 
-    item->runstate_cnt[v->runstate.state]--;
-    item->runstate_cnt[new_state]++;
+    if ( !is_idle_vcpu(v) )
+    {
+        item->runstate_cnt[v->runstate.state]--;
+        item->runstate_cnt[new_state]++;
+    }
 
     delta = new_entry_time - v->runstate.state_entry_time;
     if ( delta > 0 )
@@ -209,21 +223,6 @@  static inline void vcpu_runstate_change(
     v->runstate.state = new_state;
 }
 
-static inline void sched_item_runstate_change(struct sched_item *item,
-    bool running, s_time_t new_entry_time)
-{
-    struct vcpu *v;
-
-    for_each_sched_item_vcpu( item, v )
-        if ( running )
-            vcpu_runstate_change(v, v->new_state, new_entry_time);
-        else
-            vcpu_runstate_change(v,
-                ((v->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
-                 (vcpu_runnable(v) ? RUNSTATE_runnable : RUNSTATE_offline)),
-                new_entry_time);
-}
-
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate)
 {
     spinlock_t *lock = likely(v == current)
@@ -456,6 +455,7 @@  int sched_init_vcpu(struct vcpu *v)
     if ( is_idle_domain(d) )
     {
         per_cpu(sched_res, v->processor)->curr = item;
+        per_cpu(sched_res, v->processor)->sched_item_idle = item;
         v->is_running = 1;
         item->is_running = 1;
         item->state_entry_time = NOW();
@@ -1631,33 +1631,67 @@  static void sched_switch_items(struct sched_resource *sd,
                                struct sched_item *next, struct sched_item *prev,
                                s_time_t now)
 {
-    sd->curr = next;
-
-    TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id, prev->item_id,
-             now - prev->state_entry_time);
-    TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id, next->item_id,
-             (next->vcpu->runstate.state == RUNSTATE_runnable) ?
-             (now - next->state_entry_time) : 0, prev->next_time);
+    int cpu;
 
     ASSERT(item_running(prev));
 
-    TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id,
-             next->domain->domain_id, next->item_id);
+    if ( prev != next )
+    {
+        sd->curr = next;
+        sd->prev = prev;
 
-    sched_item_runstate_change(prev, false, now);
-    prev->last_run_time = now;
+        TRACE_3D(TRC_SCHED_SWITCH_INFPREV, prev->domain->domain_id,
+                 prev->item_id, now - prev->state_entry_time);
+        TRACE_4D(TRC_SCHED_SWITCH_INFNEXT, next->domain->domain_id,
+                 next->item_id,
+                 (next->vcpu->runstate.state == RUNSTATE_runnable) ?
+                 (now - next->state_entry_time) : 0, prev->next_time);
+        TRACE_4D(TRC_SCHED_SWITCH, prev->domain->domain_id, prev->item_id,
+                 next->domain->domain_id, next->item_id);
 
-    ASSERT(!item_running(next));
-    sched_item_runstate_change(next, true, now);
+        prev->last_run_time = now;
 
-    /*
-     * NB. Don't add any trace records from here until the actual context
-     * switch, else lost_records resume will not work properly.
-     */
+        ASSERT(!item_running(next));
+
+        /*
+         * NB. Don't add any trace records from here until the actual context
+         * switch, else lost_records resume will not work properly.
+         */
+
+        ASSERT(!next->is_running);
+        next->is_running = 1;
 
-    ASSERT(!next->is_running);
-    next->vcpu->is_running = 1;
-    next->is_running = 1;
+        if ( is_idle_item(prev) )
+        {
+            prev->runstate_cnt[RUNSTATE_running] = 0;
+            prev->runstate_cnt[RUNSTATE_runnable] = sched_granularity;
+        }
+        if ( is_idle_item(next) )
+        {
+            next->runstate_cnt[RUNSTATE_running] = sched_granularity;
+            next->runstate_cnt[RUNSTATE_runnable] = 0;
+        }
+    }
+
+    for_each_cpu( cpu, sd->cpus )
+    {
+        struct vcpu *vprev = get_cpu_current(cpu);
+        struct vcpu *vnext = sched_item2vcpu_cpu(next, cpu);
+
+        if ( vprev != vnext || vprev->runstate.state != vnext->new_state )
+        {
+            vcpu_runstate_change(vprev,
+                ((vprev->pause_flags & VPF_blocked) ? RUNSTATE_blocked :
+                 (vcpu_runnable(vprev) ? RUNSTATE_runnable : RUNSTATE_offline)),
+                now);
+            vcpu_runstate_change(vnext, vnext->new_state, now);
+        }
+
+        vnext->is_running = 1;
+
+        if ( is_idle_vcpu(vnext) )
+            vnext->sched_item = next;
+    }
 }
 
 static bool sched_tasklet_check(void)
@@ -1706,25 +1740,25 @@  static struct sched_item *do_schedule(struct sched_item *prev, s_time_t now)
     if ( prev->next_time >= 0 ) /* -ve means no limit */
         set_timer(&sd->s_timer, now + prev->next_time);
 
-    if ( likely(prev != next) )
-        sched_switch_items(sd, next, prev, now);
+    sched_switch_items(sd, next, prev, now);
 
     return next;
 }
 
-static void context_saved(struct vcpu *prev)
+static void context_saved(struct sched_item *item)
 {
-    struct sched_item *item = prev->sched_item;
-
     item->is_running = 0;
     item->state_entry_time = NOW();
+    this_cpu(sched_res)->prev = NULL;
 
     /* Check for migration request /after/ clearing running flag. */
     smp_mb();
 
-    sched_context_saved(vcpu_scheduler(prev), item);
+    sched_context_saved(vcpu_scheduler(item->vcpu), item);
 
-    sched_item_migrate_finish(item);
+    /* Idle never migrates and idle vcpus might belong to other items. */
+    if ( !is_idle_item(item) )
+        sched_item_migrate_finish(item);
 }
 
 /*
@@ -1741,11 +1775,13 @@  static void context_saved(struct vcpu *prev)
 void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
 {
     struct sched_item *next = vnext->sched_item;
+    struct sched_resource *sd = this_cpu(sched_res);
 
     /* Clear running flag /after/ writing context to memory. */
     smp_wmb();
 
-    vprev->is_running = 0;
+    if ( vprev != vnext )
+        vprev->is_running = 0;
 
     if ( atomic_read(&next->rendezvous_out_cnt) )
     {
@@ -1754,20 +1790,23 @@  void sched_context_switched(struct vcpu *vprev, struct vcpu *vnext)
         /* Call context_saved() before releasing other waiters. */
         if ( cnt == 1 )
         {
-            if ( vprev != vnext )
-                context_saved(vprev);
+            if ( sd->prev )
+                context_saved(sd->prev);
             atomic_set(&next->rendezvous_out_cnt, 0);
         }
         else
             while ( atomic_read(&next->rendezvous_out_cnt) )
                 cpu_relax();
     }
-    else if ( vprev != vnext && sched_granularity == 1 )
-        context_saved(vprev);
+    else if ( sd->prev )
+        context_saved(sd->prev);
+
+    if ( is_idle_vcpu(vprev) && vprev != vnext )
+        vprev->sched_item = sd->sched_item_idle;
 }
 
 static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
-                                 s_time_t now)
+                                 bool reset_idle_item, s_time_t now)
 {
     if ( unlikely(vprev == vnext) )
     {
@@ -1776,6 +1815,10 @@  static void sched_context_switch(struct vcpu *vprev, struct vcpu *vnext,
                  now - vprev->runstate.state_entry_time,
                  vprev->sched_item->next_time);
         sched_context_switched(vprev, vnext);
+
+        if ( reset_idle_item )
+            vnext->sched_item = this_cpu(sched_res)->sched_item_idle;
+
         trace_continue_running(vnext);
         return continue_running(vprev);
     }
@@ -1851,7 +1894,8 @@  static void sched_slave(void)
 
     pcpu_schedule_unlock_irq(lock, cpu);
 
-    sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu), now);
+    sched_context_switch(vprev, sched_item2vcpu_cpu(next, cpu),
+                         is_idle_item(next) && !is_idle_item(prev), now);
 }
 
 /*
@@ -1911,7 +1955,8 @@  static void schedule(void)
     pcpu_schedule_unlock_irq(lock, cpu);
 
     vnext = sched_item2vcpu_cpu(next, cpu);
-    sched_context_switch(vprev, vnext, now);
+    sched_context_switch(vprev, vnext,
+                         !is_idle_item(prev) && is_idle_item(next), now);
 }
 
 /* The scheduler timer: force a run through the scheduler */
@@ -1993,6 +2038,7 @@  static int cpu_schedule_up(unsigned int cpu)
         return -ENOMEM;
 
     sd->curr = idle_vcpu[cpu]->sched_item;
+    sd->sched_item_idle = idle_vcpu[cpu]->sched_item;
 
     /*
      * We don't want to risk calling xfree() on an sd->sched_priv
@@ -2170,6 +2216,7 @@  void __init scheduler_init(void)
     if ( vcpu_create(idle_domain, 0) == NULL )
         BUG();
     this_cpu(sched_res)->curr = idle_vcpu[0]->sched_item;
+    this_cpu(sched_res)->sched_item_idle = idle_vcpu[0]->sched_item;
     this_cpu(sched_res)->sched_priv = sched_alloc_pdata(&ops, 0);
     BUG_ON(IS_ERR(this_cpu(sched_res)->sched_priv));
     scheduler_percpu_init(0);
diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h
index c4af66fbb9..a7602eef8c 100644
--- a/xen/include/asm-arm/current.h
+++ b/xen/include/asm-arm/current.h
@@ -18,6 +18,7 @@  DECLARE_PER_CPU(struct vcpu *, curr_vcpu);
 
 #define current            (this_cpu(curr_vcpu))
 #define set_current(vcpu)  do { current = (vcpu); } while (0)
+#define get_cpu_current(cpu)  (per_cpu(curr_vcpu, cpu))
 
 /* Per-VCPU state that lives at the top of the stack */
 struct cpu_info {
diff --git a/xen/include/asm-x86/current.h b/xen/include/asm-x86/current.h
index 5bd64b2271..cb5b6f1176 100644
--- a/xen/include/asm-x86/current.h
+++ b/xen/include/asm-x86/current.h
@@ -76,6 +76,11 @@  struct cpu_info {
     /* get_stack_bottom() must be 16-byte aligned */
 };
 
+static inline struct cpu_info *get_cpu_info_from_stack(unsigned long sp)
+{
+    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+}
+
 static inline struct cpu_info *get_cpu_info(void)
 {
 #ifdef __clang__
@@ -86,7 +91,7 @@  static inline struct cpu_info *get_cpu_info(void)
     register unsigned long sp asm("rsp");
 #endif
 
-    return (struct cpu_info *)((sp | (STACK_SIZE - 1)) + 1) - 1;
+    return get_cpu_info_from_stack(sp);
 }
 
 #define get_current()         (get_cpu_info()->current_vcpu)
diff --git a/xen/include/asm-x86/smp.h b/xen/include/asm-x86/smp.h
index 9f533f9072..51a31ab00a 100644
--- a/xen/include/asm-x86/smp.h
+++ b/xen/include/asm-x86/smp.h
@@ -76,6 +76,9 @@  void set_nr_sockets(void);
 /* Representing HT and core siblings in each socket. */
 extern cpumask_t **socket_cpumask;
 
+#define get_cpu_current(cpu) \
+    (get_cpu_info_from_stack((unsigned long)stack_base[cpu])->current_vcpu)
+
 #endif /* !__ASSEMBLY__ */
 
 #endif
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index b3921f3a41..8981d41629 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -39,6 +39,8 @@  struct sched_resource {
     spinlock_t         *schedule_lock,
                        _lock;
     struct sched_item  *curr;           /* current task                    */
+    struct sched_item  *sched_item_idle;
+    struct sched_item  *prev;           /* previous task                   */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
     atomic_t            urgent_count;   /* how many urgent vcpus           */
@@ -152,7 +154,7 @@  static inline void sched_clear_pause_flags_atomic(struct sched_item *item,
 
 static inline struct sched_item *sched_idle_item(unsigned int cpu)
 {
-    return idle_vcpu[cpu]->sched_item;
+    return per_cpu(sched_res, cpu)->sched_item_idle;
 }
 
 static inline unsigned int sched_get_resource_cpu(unsigned int cpu)