diff mbox

[V2,RFC,3/3] kvm: Check system load and handle different commit cases accordingly

Message ID 20121029140717.15448.83182.sendpatchset@codeblue (mailing list archive)
State New, archived
Headers show

Commit Message

Raghavendra K T Oct. 29, 2012, 2:07 p.m. UTC
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

The patch indroduces a helper function that calculates the system load
(idea borrowed from loadavg calculation). The load is normalized to
2048 i.e., return value (threshold) of 2048 implies an approximate 1:1
committed guest.

In undercommit cases (threshold/2) we simply return from PLE handler.
In overcommit cases (1.75 * threshold) we do a yield(). The rationale is to
allow other VMs of the host to run instead of burning the cpu cycle.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
Idea of yielding in overcommit cases (especially in large number of small guest
cases was
Acked-by: Rik van Riel <riel@redhat.com>
Andrew Theurer also has stressed the importance of reducing yield_to
overhead and using yield().

(let threshold = 2048)
Rationale for using threshold/2 for undercommit limit:
 Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
scenarios where we still have lock holder preempted vcpu waiting to be
scheduled. (scenario arises when rq length is > 1 even when we are under
committed)

Rationale for using (1.75 * threshold) for overcommit scenario:
This is a heuristic where we should probably see rq length > 1
and a vcpu of a different VM is waiting to be scheduled.

 virt/kvm/kvm_main.c |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Peter Zijlstra Oct. 29, 2012, 5:54 p.m. UTC | #1
On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> +/*
> + * A load of 2048 corresponds to 1:1 overcommit
> + * undercommit threshold is half the 1:1 overcommit
> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> + */
> +#define COMMIT_THRESHOLD (FIXED_1)
> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> (COMMIT_THRESHOLD >> 2))
> +
> +unsigned long kvm_system_load(void)
> +{
> +       unsigned long load;
> +
> +       load = avenrun[0] + FIXED_1/200;
> +       load = load / num_online_cpus();
> +
> +       return load;
> +} 

ARGH.. no that's wrong.. very wrong.

 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
joke.

 2) avenrun[] is a global load, do not ever use a global load measure

 3) avenrun[] has nothing what so ever to do with runqueue lengths,
someone with a gazillion tasks in D state will get a huge load but the
cpu is very idle.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Oct. 30, 2012, 5:57 a.m. UTC | #2
On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>> +/*
>> + * A load of 2048 corresponds to 1:1 overcommit
>> + * undercommit threshold is half the 1:1 overcommit
>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>> + */
>> +#define COMMIT_THRESHOLD (FIXED_1)
>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>> (COMMIT_THRESHOLD >> 2))
>> +
>> +unsigned long kvm_system_load(void)
>> +{
>> +       unsigned long load;
>> +
>> +       load = avenrun[0] + FIXED_1/200;
>> +       load = load / num_online_cpus();
>> +
>> +       return load;
>> +}
>
> ARGH.. no that's wrong.. very wrong.
>
>   1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> joke.

Okay.

>   2) avenrun[] is a global load, do not ever use a global load measure

This makes sense. Using a local optimization that leads to near global
optimization is the way to go.

>
>   3) avenrun[] has nothing what so ever to do with runqueue lengths,
> someone with a gazillion tasks in D state will get a huge load but the
> cpu is very idle.
>

I used loadavg as an alternative measure. But the above condition
poses a concern for that.

Okay, now IIUC, usage of *any* global measure is bad?

Because I was also thinking to use nrrunning()/ num_online_cpus(), to
get an idea of global overcommit sense. (ofcourse since, this involves
iteration over per CPU nrrunning, I wanted to calculate this
periodically)

The overall logic, of having overcommit_threshold, 
undercommit_threshold, I wanted to use for even dynamic ple_window 
tuning purpose.

so logic was:
< undercommit_threshold => 16k ple_window
 > overcommit_threshold  => 4k window.
for in between case scale the ple_window accordingly.

The alternative was to decide depending on how ple handler succeeded in
yield_to. But I thought, that is too sensitive and more overhead.

This topic may deserve different thread, but thought I shall table it here.

So, Thinking about the alternatives to implement, logic such as

(a) if(undercommitted)
     just go back and spin rather than going for yield_to iteration.
(b) if (overcommitted)
    better to yield rather than  spinning logic

    of current patches..

[ ofcourse, (a) is already met to large extent by your patches..]

So I think everything boils down to

"how do we measure these two thresholds without much overhead in a
compliant way"

Ideas welcome..

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Oct. 30, 2012, 6:34 a.m. UTC | #3
On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> >On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> >>+/*
> >>+ * A load of 2048 corresponds to 1:1 overcommit
> >>+ * undercommit threshold is half the 1:1 overcommit
> >>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> >>+ */
> >>+#define COMMIT_THRESHOLD (FIXED_1)
> >>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> >>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> >>(COMMIT_THRESHOLD >> 2))
> >>+
> >>+unsigned long kvm_system_load(void)
> >>+{
> >>+       unsigned long load;
> >>+
> >>+       load = avenrun[0] + FIXED_1/200;
> >>+       load = load / num_online_cpus();
> >>+
> >>+       return load;
> >>+}
> >
> >ARGH.. no that's wrong.. very wrong.
> >
> >  1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> >joke.
> 
> Okay.
> 
> >  2) avenrun[] is a global load, do not ever use a global load measure
> 
> This makes sense. Using a local optimization that leads to near global
> optimization is the way to go.
> 
> >
> >  3) avenrun[] has nothing what so ever to do with runqueue lengths,
> >someone with a gazillion tasks in D state will get a huge load but the
> >cpu is very idle.
> >
> 
> I used loadavg as an alternative measure. But the above condition
> poses a concern for that.
> 
> Okay, now IIUC, usage of *any* global measure is bad?
> 
> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
> get an idea of global overcommit sense. (ofcourse since, this involves
> iteration over per CPU nrrunning, I wanted to calculate this
> periodically)
> 
> The overall logic, of having overcommit_threshold,
> undercommit_threshold, I wanted to use for even dynamic ple_window
> tuning purpose.
> 
> so logic was:
> < undercommit_threshold => 16k ple_window
> > overcommit_threshold  => 4k window.
> for in between case scale the ple_window accordingly.
> 
> The alternative was to decide depending on how ple handler succeeded in
> yield_to. But I thought, that is too sensitive and more overhead.
> 
> This topic may deserve different thread, but thought I shall table it here.
> 
> So, Thinking about the alternatives to implement, logic such as
> 
> (a) if(undercommitted)
>     just go back and spin rather than going for yield_to iteration.
> (b) if (overcommitted)
>    better to yield rather than  spinning logic
> 
>    of current patches..
> 
> [ ofcourse, (a) is already met to large extent by your patches..]
> 
> So I think everything boils down to
> 
> "how do we measure these two thresholds without much overhead in a
> compliant way"
> 
> Ideas welcome..
> 

What happened to Avi's preempt notifier idea for determining
under/overcommit? If nobody has picked that up yet, then I'll go ahead and
try to prototype it.

Drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Oct. 30, 2012, 7:31 a.m. UTC | #4
On 10/30/2012 12:04 PM, Andrew Jones wrote:
> On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
>> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
>>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>>>> +/*
>>>> + * A load of 2048 corresponds to 1:1 overcommit
>>>> + * undercommit threshold is half the 1:1 overcommit
>>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>>>> + */
>>>> +#define COMMIT_THRESHOLD (FIXED_1)
>>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>>>> (COMMIT_THRESHOLD >> 2))
>>>> +
>>>> +unsigned long kvm_system_load(void)
>>>> +{
>>>> +       unsigned long load;
>>>> +
>>>> +       load = avenrun[0] + FIXED_1/200;
>>>> +       load = load / num_online_cpus();
>>>> +
>>>> +       return load;
>>>> +}
>>>
>>> ARGH.. no that's wrong.. very wrong.
>>>
>>>   1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
>>> joke.
>>
>> Okay.
>>
>>>   2) avenrun[] is a global load, do not ever use a global load measure
>>
>> This makes sense. Using a local optimization that leads to near global
>> optimization is the way to go.
>>
>>>
>>>   3) avenrun[] has nothing what so ever to do with runqueue lengths,
>>> someone with a gazillion tasks in D state will get a huge load but the
>>> cpu is very idle.
>>>
>>
>> I used loadavg as an alternative measure. But the above condition
>> poses a concern for that.
>>
>> Okay, now IIUC, usage of *any* global measure is bad?
>>
>> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
>> get an idea of global overcommit sense. (ofcourse since, this involves
>> iteration over per CPU nrrunning, I wanted to calculate this
>> periodically)
>>
>> The overall logic, of having overcommit_threshold,
>> undercommit_threshold, I wanted to use for even dynamic ple_window
>> tuning purpose.
>>
>> so logic was:
>> < undercommit_threshold => 16k ple_window
>>> overcommit_threshold  => 4k window.
>> for in between case scale the ple_window accordingly.
>>
>> The alternative was to decide depending on how ple handler succeeded in
>> yield_to. But I thought, that is too sensitive and more overhead.
>>
>> This topic may deserve different thread, but thought I shall table it here.
>>
>> So, Thinking about the alternatives to implement, logic such as
>>
>> (a) if(undercommitted)
>>      just go back and spin rather than going for yield_to iteration.
>> (b) if (overcommitted)
>>     better to yield rather than  spinning logic
>>
>>     of current patches..
>>
>> [ ofcourse, (a) is already met to large extent by your patches..]
>>
>> So I think everything boils down to
>>
>> "how do we measure these two thresholds without much overhead in a
>> compliant way"
>>
>> Ideas welcome..
>>
>
> What happened to Avi's preempt notifier idea for determining
> under/overcommit? If nobody has picked that up yet, then I'll go ahead and
> try to prototype it.

Hi Drew,

I had assumed my priority order as
1) this patch series 2) dynamic ple window 3) preempt notifiers.

But I do not have any problem on re-prioritizing / helping on these
as far as we are clear on what we are looking into.

I was thinking about preempt notifier idea as a tool to refine
candidate VCPUs. But you are right, Avi, also told we can use
bitmap/counter itself as an indicator to decide whether we go ahead
with yield_to at all.

IMO, only patch(3) has some conflict because of various approach we can
try.May be we should attack the problem via all 3 solutions at once and
decide?

To be frank, within each of the approach, trying/analyzing all the
  possibilities made the things slow.. (my end).

Suggestions..?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Oct. 30, 2012, 8:14 a.m. UTC | #5
On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote:
> Okay, now IIUC, usage of *any* global measure is bad?

Yep, people like to carve up their machines, esp. now that they're
somewhat bigger than they used to be. This can result in very asymmetric
loads, no global measure can ever deal with that.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Jones Oct. 30, 2012, 9:07 a.m. UTC | #6
On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote:
> On 10/30/2012 12:04 PM, Andrew Jones wrote:
> >On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
> >>On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
> >>>On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
> >>>>+/*
> >>>>+ * A load of 2048 corresponds to 1:1 overcommit
> >>>>+ * undercommit threshold is half the 1:1 overcommit
> >>>>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
> >>>>+ */
> >>>>+#define COMMIT_THRESHOLD (FIXED_1)
> >>>>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
> >>>>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
> >>>>(COMMIT_THRESHOLD >> 2))
> >>>>+
> >>>>+unsigned long kvm_system_load(void)
> >>>>+{
> >>>>+       unsigned long load;
> >>>>+
> >>>>+       load = avenrun[0] + FIXED_1/200;
> >>>>+       load = load / num_online_cpus();
> >>>>+
> >>>>+       return load;
> >>>>+}
> >>>
> >>>ARGH.. no that's wrong.. very wrong.
> >>>
> >>>  1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
> >>>joke.
> >>
> >>Okay.
> >>
> >>>  2) avenrun[] is a global load, do not ever use a global load measure
> >>
> >>This makes sense. Using a local optimization that leads to near global
> >>optimization is the way to go.
> >>
> >>>
> >>>  3) avenrun[] has nothing what so ever to do with runqueue lengths,
> >>>someone with a gazillion tasks in D state will get a huge load but the
> >>>cpu is very idle.
> >>>
> >>
> >>I used loadavg as an alternative measure. But the above condition
> >>poses a concern for that.
> >>
> >>Okay, now IIUC, usage of *any* global measure is bad?
> >>
> >>Because I was also thinking to use nrrunning()/ num_online_cpus(), to
> >>get an idea of global overcommit sense. (ofcourse since, this involves
> >>iteration over per CPU nrrunning, I wanted to calculate this
> >>periodically)
> >>
> >>The overall logic, of having overcommit_threshold,
> >>undercommit_threshold, I wanted to use for even dynamic ple_window
> >>tuning purpose.
> >>
> >>so logic was:
> >>< undercommit_threshold => 16k ple_window
> >>>overcommit_threshold  => 4k window.
> >>for in between case scale the ple_window accordingly.
> >>
> >>The alternative was to decide depending on how ple handler succeeded in
> >>yield_to. But I thought, that is too sensitive and more overhead.
> >>
> >>This topic may deserve different thread, but thought I shall table it here.
> >>
> >>So, Thinking about the alternatives to implement, logic such as
> >>
> >>(a) if(undercommitted)
> >>     just go back and spin rather than going for yield_to iteration.
> >>(b) if (overcommitted)
> >>    better to yield rather than  spinning logic
> >>
> >>    of current patches..
> >>
> >>[ ofcourse, (a) is already met to large extent by your patches..]
> >>
> >>So I think everything boils down to
> >>
> >>"how do we measure these two thresholds without much overhead in a
> >>compliant way"
> >>
> >>Ideas welcome..
> >>
> >
> >What happened to Avi's preempt notifier idea for determining
> >under/overcommit? If nobody has picked that up yet, then I'll go ahead and
> >try to prototype it.
> 
> Hi Drew,
> 
> I had assumed my priority order as
> 1) this patch series 2) dynamic ple window 3) preempt notifiers.
> 
> But I do not have any problem on re-prioritizing / helping on these
> as far as we are clear on what we are looking into.
> 
> I was thinking about preempt notifier idea as a tool to refine
> candidate VCPUs. But you are right, Avi, also told we can use
> bitmap/counter itself as an indicator to decide whether we go ahead
> with yield_to at all.
> 
> IMO, only patch(3) has some conflict because of various approach we can
> try.May be we should attack the problem via all 3 solutions at once and
> decide?
> 
> To be frank, within each of the approach, trying/analyzing all the
>  possibilities made the things slow.. (my end).
> 
> Suggestions..?
> 

I agree, it's a complex problem that needs lots of trial+error work. We
should definitely work in parallel on multiple ideas. I'll go ahead and
dig into the preempt notifiers.

Drew
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Oct. 31, 2012, 6:10 a.m. UTC | #7
On 10/30/2012 01:44 PM, Peter Zijlstra wrote:
> On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote:
>> Okay, now IIUC, usage of *any* global measure is bad?
>
> Yep, people like to carve up their machines, esp. now that they're
> somewhat bigger than they used to be. This can result in very asymmetric
> loads, no global measure can ever deal with that.

Thanks for explaining the concerns. Very True and if load is very 
asymmetric due to power optimization etc constraints. This may affect.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Raghavendra K T Oct. 31, 2012, 12:24 p.m. UTC | #8
On 10/30/2012 02:37 PM, Andrew Jones wrote:
> On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote:
>> On 10/30/2012 12:04 PM, Andrew Jones wrote:
>>> On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote:
>>>> On 10/29/2012 11:24 PM, Peter Zijlstra wrote:
>>>>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote:
>>>>>> +/*
>>>>>> + * A load of 2048 corresponds to 1:1 overcommit
>>>>>> + * undercommit threshold is half the 1:1 overcommit
>>>>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold
>>>>>> + */
>>>>>> +#define COMMIT_THRESHOLD (FIXED_1)
>>>>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
>>>>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) -
>>>>>> (COMMIT_THRESHOLD >> 2))
>>>>>> +
>>>>>> +unsigned long kvm_system_load(void)
>>>>>> +{
>>>>>> +       unsigned long load;
>>>>>> +
>>>>>> +       load = avenrun[0] + FIXED_1/200;
>>>>>> +       load = load / num_online_cpus();
>>>>>> +
>>>>>> +       return load;
>>>>>> +}
>>>>>
>>>>> ARGH.. no that's wrong.. very wrong.
>>>>>
>>>>>   1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a
>>>>> joke.
>>>>
>>>> Okay.
>>>>
>>>>>   2) avenrun[] is a global load, do not ever use a global load measure
>>>>
>>>> This makes sense. Using a local optimization that leads to near global
>>>> optimization is the way to go.
>>>>
>>>>>
>>>>>   3) avenrun[] has nothing what so ever to do with runqueue lengths,
>>>>> someone with a gazillion tasks in D state will get a huge load but the
>>>>> cpu is very idle.
>>>>>
>>>>
>>>> I used loadavg as an alternative measure. But the above condition
>>>> poses a concern for that.
>>>>
>>>> Okay, now IIUC, usage of *any* global measure is bad?
>>>>
>>>> Because I was also thinking to use nrrunning()/ num_online_cpus(), to
>>>> get an idea of global overcommit sense. (ofcourse since, this involves
>>>> iteration over per CPU nrrunning, I wanted to calculate this
>>>> periodically)
>>>>
>>>> The overall logic, of having overcommit_threshold,
>>>> undercommit_threshold, I wanted to use for even dynamic ple_window
>>>> tuning purpose.
>>>>
>>>> so logic was:
>>>> < undercommit_threshold => 16k ple_window
>>>>> overcommit_threshold  => 4k window.
>>>> for in between case scale the ple_window accordingly.
>>>>
>>>> The alternative was to decide depending on how ple handler succeeded in
>>>> yield_to. But I thought, that is too sensitive and more overhead.
>>>>
>>>> This topic may deserve different thread, but thought I shall table it here.
>>>>
>>>> So, Thinking about the alternatives to implement, logic such as
>>>>
>>>> (a) if(undercommitted)
>>>>      just go back and spin rather than going for yield_to iteration.
>>>> (b) if (overcommitted)
>>>>     better to yield rather than  spinning logic
>>>>
>>>>     of current patches..
>>>>
>>>> [ ofcourse, (a) is already met to large extent by your patches..]
>>>>
>>>> So I think everything boils down to
>>>>
>>>> "how do we measure these two thresholds without much overhead in a
>>>> compliant way"
>>>>
>>>> Ideas welcome..
>>>>
>>>
>>> What happened to Avi's preempt notifier idea for determining
>>> under/overcommit? If nobody has picked that up yet, then I'll go ahead and
>>> try to prototype it.
>>
>> Hi Drew,
>>
>> I had assumed my priority order as
>> 1) this patch series 2) dynamic ple window 3) preempt notifiers.
>>
>> But I do not have any problem on re-prioritizing / helping on these
>> as far as we are clear on what we are looking into.
>>
>> I was thinking about preempt notifier idea as a tool to refine
>> candidate VCPUs. But you are right, Avi, also told we can use
>> bitmap/counter itself as an indicator to decide whether we go ahead
>> with yield_to at all.
>>
>> IMO, only patch(3) has some conflict because of various approach we can
>> try.May be we should attack the problem via all 3 solutions at once and
>> decide?
>>
>> To be frank, within each of the approach, trying/analyzing all the
>>   possibilities made the things slow.. (my end).
>>
>> Suggestions..?
>>
>
> I agree, it's a complex problem that needs lots of trial+error work. We
> should definitely work in parallel on multiple ideas. I'll go ahead and
> dig into the preempt notifiers.
>

Okay. Thank you. I will concentrate on dynamic_ple window.. But I think
implementation need some overlapping details from preempt notifier.

For dynamic ple window, To summarize, what we thought of
doing,

( I hope we have to keep the ple window between 4k - 16k throughout)

 From preempt notifiers:

(1) from the preempt notifier check the overcommit case, if so increase 
the ple window
questions:
How do we say we are overcommitted?
  - is it number of preemption we keep track vs total vcpus. I think so.
But we have to convert into some formula.. we shall decrease the ple 
window by some factor (unless we hit 4k)

(2) How can say we are undercommitted:
    Perhaps there is very less number of vcpus that are scheduled out 
currently. we tend to set ple window closer to max (16k).

 From yield_to failures:

if yield_to fails with ESRCH, it potentially indicate undercommit and
we can again use logic of increasing ple window.

Did we miss anything?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index e376434..28bbdfb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1697,15 +1697,43 @@  bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu)
 }
 #endif
 
+/*
+ * A load of 2048 corresponds to 1:1 overcommit
+ * undercommit threshold is half the 1:1 overcommit
+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold
+ */
+#define COMMIT_THRESHOLD (FIXED_1)
+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1)
+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - (COMMIT_THRESHOLD >> 2))
+
+unsigned long kvm_system_load(void)
+{
+	unsigned long load;
+
+	load = avenrun[0] + FIXED_1/200;
+	load = load / num_online_cpus();
+
+	return load;
+}
+
 void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
 	struct kvm *kvm = me->kvm;
 	struct kvm_vcpu *vcpu;
 	int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
 	int yielded = 0;
+	unsigned long load;
 	int pass;
 	int i;
 
+	load = kvm_system_load();
+	/*
+	 * When we are undercomitted let us not waste time in
+	 * iterating over all the VCPUs.
+	 */
+	if (load < UNDERCOMMIT_THRESHOLD)
+		return;
+
 	kvm_vcpu_set_in_spin_loop(me, true);
 	/*
 	 * We boost the priority of a VCPU that is runnable but not
@@ -1735,6 +1763,13 @@  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 				break;
 		}
 	}
+	/*
+	 * If we are not able to yield especially in overcommit cases
+	 * let us be courteous to other VM's VCPUs waiting to be scheduled.
+	 */
+	if (!yielded && load > OVERCOMMIT_THRESHOLD)
+		yield();
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */