Message ID | 20121029140717.15448.83182.sendpatchset@codeblue (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: > +/* > + * A load of 2048 corresponds to 1:1 overcommit > + * undercommit threshold is half the 1:1 overcommit > + * overcommit threshold is 1.75 times of 1:1 overcommit threshold > + */ > +#define COMMIT_THRESHOLD (FIXED_1) > +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) > +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - > (COMMIT_THRESHOLD >> 2)) > + > +unsigned long kvm_system_load(void) > +{ > + unsigned long load; > + > + load = avenrun[0] + FIXED_1/200; > + load = load / num_online_cpus(); > + > + return load; > +} ARGH.. no that's wrong.. very wrong. 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a joke. 2) avenrun[] is a global load, do not ever use a global load measure 3) avenrun[] has nothing what so ever to do with runqueue lengths, someone with a gazillion tasks in D state will get a huge load but the cpu is very idle. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/29/2012 11:24 PM, Peter Zijlstra wrote: > On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: >> +/* >> + * A load of 2048 corresponds to 1:1 overcommit >> + * undercommit threshold is half the 1:1 overcommit >> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold >> + */ >> +#define COMMIT_THRESHOLD (FIXED_1) >> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) >> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - >> (COMMIT_THRESHOLD >> 2)) >> + >> +unsigned long kvm_system_load(void) >> +{ >> + unsigned long load; >> + >> + load = avenrun[0] + FIXED_1/200; >> + load = load / num_online_cpus(); >> + >> + return load; >> +} > > ARGH.. no that's wrong.. very wrong. > > 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a > joke. Okay. > 2) avenrun[] is a global load, do not ever use a global load measure This makes sense. Using a local optimization that leads to near global optimization is the way to go. > > 3) avenrun[] has nothing what so ever to do with runqueue lengths, > someone with a gazillion tasks in D state will get a huge load but the > cpu is very idle. > I used loadavg as an alternative measure. But the above condition poses a concern for that. Okay, now IIUC, usage of *any* global measure is bad? Because I was also thinking to use nrrunning()/ num_online_cpus(), to get an idea of global overcommit sense. (ofcourse since, this involves iteration over per CPU nrrunning, I wanted to calculate this periodically) The overall logic, of having overcommit_threshold, undercommit_threshold, I wanted to use for even dynamic ple_window tuning purpose. so logic was: < undercommit_threshold => 16k ple_window > overcommit_threshold => 4k window. for in between case scale the ple_window accordingly. The alternative was to decide depending on how ple handler succeeded in yield_to. But I thought, that is too sensitive and more overhead. This topic may deserve different thread, but thought I shall table it here. So, Thinking about the alternatives to implement, logic such as (a) if(undercommitted) just go back and spin rather than going for yield_to iteration. (b) if (overcommitted) better to yield rather than spinning logic of current patches.. [ ofcourse, (a) is already met to large extent by your patches..] So I think everything boils down to "how do we measure these two thresholds without much overhead in a compliant way" Ideas welcome.. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote: > On 10/29/2012 11:24 PM, Peter Zijlstra wrote: > >On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: > >>+/* > >>+ * A load of 2048 corresponds to 1:1 overcommit > >>+ * undercommit threshold is half the 1:1 overcommit > >>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold > >>+ */ > >>+#define COMMIT_THRESHOLD (FIXED_1) > >>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) > >>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - > >>(COMMIT_THRESHOLD >> 2)) > >>+ > >>+unsigned long kvm_system_load(void) > >>+{ > >>+ unsigned long load; > >>+ > >>+ load = avenrun[0] + FIXED_1/200; > >>+ load = load / num_online_cpus(); > >>+ > >>+ return load; > >>+} > > > >ARGH.. no that's wrong.. very wrong. > > > > 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a > >joke. > > Okay. > > > 2) avenrun[] is a global load, do not ever use a global load measure > > This makes sense. Using a local optimization that leads to near global > optimization is the way to go. > > > > > 3) avenrun[] has nothing what so ever to do with runqueue lengths, > >someone with a gazillion tasks in D state will get a huge load but the > >cpu is very idle. > > > > I used loadavg as an alternative measure. But the above condition > poses a concern for that. > > Okay, now IIUC, usage of *any* global measure is bad? > > Because I was also thinking to use nrrunning()/ num_online_cpus(), to > get an idea of global overcommit sense. (ofcourse since, this involves > iteration over per CPU nrrunning, I wanted to calculate this > periodically) > > The overall logic, of having overcommit_threshold, > undercommit_threshold, I wanted to use for even dynamic ple_window > tuning purpose. > > so logic was: > < undercommit_threshold => 16k ple_window > > overcommit_threshold => 4k window. > for in between case scale the ple_window accordingly. > > The alternative was to decide depending on how ple handler succeeded in > yield_to. But I thought, that is too sensitive and more overhead. > > This topic may deserve different thread, but thought I shall table it here. > > So, Thinking about the alternatives to implement, logic such as > > (a) if(undercommitted) > just go back and spin rather than going for yield_to iteration. > (b) if (overcommitted) > better to yield rather than spinning logic > > of current patches.. > > [ ofcourse, (a) is already met to large extent by your patches..] > > So I think everything boils down to > > "how do we measure these two thresholds without much overhead in a > compliant way" > > Ideas welcome.. > What happened to Avi's preempt notifier idea for determining under/overcommit? If nobody has picked that up yet, then I'll go ahead and try to prototype it. Drew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2012 12:04 PM, Andrew Jones wrote: > On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote: >> On 10/29/2012 11:24 PM, Peter Zijlstra wrote: >>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: >>>> +/* >>>> + * A load of 2048 corresponds to 1:1 overcommit >>>> + * undercommit threshold is half the 1:1 overcommit >>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold >>>> + */ >>>> +#define COMMIT_THRESHOLD (FIXED_1) >>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) >>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - >>>> (COMMIT_THRESHOLD >> 2)) >>>> + >>>> +unsigned long kvm_system_load(void) >>>> +{ >>>> + unsigned long load; >>>> + >>>> + load = avenrun[0] + FIXED_1/200; >>>> + load = load / num_online_cpus(); >>>> + >>>> + return load; >>>> +} >>> >>> ARGH.. no that's wrong.. very wrong. >>> >>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a >>> joke. >> >> Okay. >> >>> 2) avenrun[] is a global load, do not ever use a global load measure >> >> This makes sense. Using a local optimization that leads to near global >> optimization is the way to go. >> >>> >>> 3) avenrun[] has nothing what so ever to do with runqueue lengths, >>> someone with a gazillion tasks in D state will get a huge load but the >>> cpu is very idle. >>> >> >> I used loadavg as an alternative measure. But the above condition >> poses a concern for that. >> >> Okay, now IIUC, usage of *any* global measure is bad? >> >> Because I was also thinking to use nrrunning()/ num_online_cpus(), to >> get an idea of global overcommit sense. (ofcourse since, this involves >> iteration over per CPU nrrunning, I wanted to calculate this >> periodically) >> >> The overall logic, of having overcommit_threshold, >> undercommit_threshold, I wanted to use for even dynamic ple_window >> tuning purpose. >> >> so logic was: >> < undercommit_threshold => 16k ple_window >>> overcommit_threshold => 4k window. >> for in between case scale the ple_window accordingly. >> >> The alternative was to decide depending on how ple handler succeeded in >> yield_to. But I thought, that is too sensitive and more overhead. >> >> This topic may deserve different thread, but thought I shall table it here. >> >> So, Thinking about the alternatives to implement, logic such as >> >> (a) if(undercommitted) >> just go back and spin rather than going for yield_to iteration. >> (b) if (overcommitted) >> better to yield rather than spinning logic >> >> of current patches.. >> >> [ ofcourse, (a) is already met to large extent by your patches..] >> >> So I think everything boils down to >> >> "how do we measure these two thresholds without much overhead in a >> compliant way" >> >> Ideas welcome.. >> > > What happened to Avi's preempt notifier idea for determining > under/overcommit? If nobody has picked that up yet, then I'll go ahead and > try to prototype it. Hi Drew, I had assumed my priority order as 1) this patch series 2) dynamic ple window 3) preempt notifiers. But I do not have any problem on re-prioritizing / helping on these as far as we are clear on what we are looking into. I was thinking about preempt notifier idea as a tool to refine candidate VCPUs. But you are right, Avi, also told we can use bitmap/counter itself as an indicator to decide whether we go ahead with yield_to at all. IMO, only patch(3) has some conflict because of various approach we can try.May be we should attack the problem via all 3 solutions at once and decide? To be frank, within each of the approach, trying/analyzing all the possibilities made the things slow.. (my end). Suggestions..? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote:
> Okay, now IIUC, usage of *any* global measure is bad?
Yep, people like to carve up their machines, esp. now that they're
somewhat bigger than they used to be. This can result in very asymmetric
loads, no global measure can ever deal with that.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote: > On 10/30/2012 12:04 PM, Andrew Jones wrote: > >On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote: > >>On 10/29/2012 11:24 PM, Peter Zijlstra wrote: > >>>On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: > >>>>+/* > >>>>+ * A load of 2048 corresponds to 1:1 overcommit > >>>>+ * undercommit threshold is half the 1:1 overcommit > >>>>+ * overcommit threshold is 1.75 times of 1:1 overcommit threshold > >>>>+ */ > >>>>+#define COMMIT_THRESHOLD (FIXED_1) > >>>>+#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) > >>>>+#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - > >>>>(COMMIT_THRESHOLD >> 2)) > >>>>+ > >>>>+unsigned long kvm_system_load(void) > >>>>+{ > >>>>+ unsigned long load; > >>>>+ > >>>>+ load = avenrun[0] + FIXED_1/200; > >>>>+ load = load / num_online_cpus(); > >>>>+ > >>>>+ return load; > >>>>+} > >>> > >>>ARGH.. no that's wrong.. very wrong. > >>> > >>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a > >>>joke. > >> > >>Okay. > >> > >>> 2) avenrun[] is a global load, do not ever use a global load measure > >> > >>This makes sense. Using a local optimization that leads to near global > >>optimization is the way to go. > >> > >>> > >>> 3) avenrun[] has nothing what so ever to do with runqueue lengths, > >>>someone with a gazillion tasks in D state will get a huge load but the > >>>cpu is very idle. > >>> > >> > >>I used loadavg as an alternative measure. But the above condition > >>poses a concern for that. > >> > >>Okay, now IIUC, usage of *any* global measure is bad? > >> > >>Because I was also thinking to use nrrunning()/ num_online_cpus(), to > >>get an idea of global overcommit sense. (ofcourse since, this involves > >>iteration over per CPU nrrunning, I wanted to calculate this > >>periodically) > >> > >>The overall logic, of having overcommit_threshold, > >>undercommit_threshold, I wanted to use for even dynamic ple_window > >>tuning purpose. > >> > >>so logic was: > >>< undercommit_threshold => 16k ple_window > >>>overcommit_threshold => 4k window. > >>for in between case scale the ple_window accordingly. > >> > >>The alternative was to decide depending on how ple handler succeeded in > >>yield_to. But I thought, that is too sensitive and more overhead. > >> > >>This topic may deserve different thread, but thought I shall table it here. > >> > >>So, Thinking about the alternatives to implement, logic such as > >> > >>(a) if(undercommitted) > >> just go back and spin rather than going for yield_to iteration. > >>(b) if (overcommitted) > >> better to yield rather than spinning logic > >> > >> of current patches.. > >> > >>[ ofcourse, (a) is already met to large extent by your patches..] > >> > >>So I think everything boils down to > >> > >>"how do we measure these two thresholds without much overhead in a > >>compliant way" > >> > >>Ideas welcome.. > >> > > > >What happened to Avi's preempt notifier idea for determining > >under/overcommit? If nobody has picked that up yet, then I'll go ahead and > >try to prototype it. > > Hi Drew, > > I had assumed my priority order as > 1) this patch series 2) dynamic ple window 3) preempt notifiers. > > But I do not have any problem on re-prioritizing / helping on these > as far as we are clear on what we are looking into. > > I was thinking about preempt notifier idea as a tool to refine > candidate VCPUs. But you are right, Avi, also told we can use > bitmap/counter itself as an indicator to decide whether we go ahead > with yield_to at all. > > IMO, only patch(3) has some conflict because of various approach we can > try.May be we should attack the problem via all 3 solutions at once and > decide? > > To be frank, within each of the approach, trying/analyzing all the > possibilities made the things slow.. (my end). > > Suggestions..? > I agree, it's a complex problem that needs lots of trial+error work. We should definitely work in parallel on multiple ideas. I'll go ahead and dig into the preempt notifiers. Drew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2012 01:44 PM, Peter Zijlstra wrote: > On Tue, 2012-10-30 at 11:27 +0530, Raghavendra K T wrote: >> Okay, now IIUC, usage of *any* global measure is bad? > > Yep, people like to carve up their machines, esp. now that they're > somewhat bigger than they used to be. This can result in very asymmetric > loads, no global measure can ever deal with that. Thanks for explaining the concerns. Very True and if load is very asymmetric due to power optimization etc constraints. This may affect. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/30/2012 02:37 PM, Andrew Jones wrote: > On Tue, Oct 30, 2012 at 01:01:54PM +0530, Raghavendra K T wrote: >> On 10/30/2012 12:04 PM, Andrew Jones wrote: >>> On Tue, Oct 30, 2012 at 11:27:52AM +0530, Raghavendra K T wrote: >>>> On 10/29/2012 11:24 PM, Peter Zijlstra wrote: >>>>> On Mon, 2012-10-29 at 19:37 +0530, Raghavendra K T wrote: >>>>>> +/* >>>>>> + * A load of 2048 corresponds to 1:1 overcommit >>>>>> + * undercommit threshold is half the 1:1 overcommit >>>>>> + * overcommit threshold is 1.75 times of 1:1 overcommit threshold >>>>>> + */ >>>>>> +#define COMMIT_THRESHOLD (FIXED_1) >>>>>> +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) >>>>>> +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - >>>>>> (COMMIT_THRESHOLD >> 2)) >>>>>> + >>>>>> +unsigned long kvm_system_load(void) >>>>>> +{ >>>>>> + unsigned long load; >>>>>> + >>>>>> + load = avenrun[0] + FIXED_1/200; >>>>>> + load = load / num_online_cpus(); >>>>>> + >>>>>> + return load; >>>>>> +} >>>>> >>>>> ARGH.. no that's wrong.. very wrong. >>>>> >>>>> 1) avenrun[] EXPORT_SYMBOL says it should be removed, that's not a >>>>> joke. >>>> >>>> Okay. >>>> >>>>> 2) avenrun[] is a global load, do not ever use a global load measure >>>> >>>> This makes sense. Using a local optimization that leads to near global >>>> optimization is the way to go. >>>> >>>>> >>>>> 3) avenrun[] has nothing what so ever to do with runqueue lengths, >>>>> someone with a gazillion tasks in D state will get a huge load but the >>>>> cpu is very idle. >>>>> >>>> >>>> I used loadavg as an alternative measure. But the above condition >>>> poses a concern for that. >>>> >>>> Okay, now IIUC, usage of *any* global measure is bad? >>>> >>>> Because I was also thinking to use nrrunning()/ num_online_cpus(), to >>>> get an idea of global overcommit sense. (ofcourse since, this involves >>>> iteration over per CPU nrrunning, I wanted to calculate this >>>> periodically) >>>> >>>> The overall logic, of having overcommit_threshold, >>>> undercommit_threshold, I wanted to use for even dynamic ple_window >>>> tuning purpose. >>>> >>>> so logic was: >>>> < undercommit_threshold => 16k ple_window >>>>> overcommit_threshold => 4k window. >>>> for in between case scale the ple_window accordingly. >>>> >>>> The alternative was to decide depending on how ple handler succeeded in >>>> yield_to. But I thought, that is too sensitive and more overhead. >>>> >>>> This topic may deserve different thread, but thought I shall table it here. >>>> >>>> So, Thinking about the alternatives to implement, logic such as >>>> >>>> (a) if(undercommitted) >>>> just go back and spin rather than going for yield_to iteration. >>>> (b) if (overcommitted) >>>> better to yield rather than spinning logic >>>> >>>> of current patches.. >>>> >>>> [ ofcourse, (a) is already met to large extent by your patches..] >>>> >>>> So I think everything boils down to >>>> >>>> "how do we measure these two thresholds without much overhead in a >>>> compliant way" >>>> >>>> Ideas welcome.. >>>> >>> >>> What happened to Avi's preempt notifier idea for determining >>> under/overcommit? If nobody has picked that up yet, then I'll go ahead and >>> try to prototype it. >> >> Hi Drew, >> >> I had assumed my priority order as >> 1) this patch series 2) dynamic ple window 3) preempt notifiers. >> >> But I do not have any problem on re-prioritizing / helping on these >> as far as we are clear on what we are looking into. >> >> I was thinking about preempt notifier idea as a tool to refine >> candidate VCPUs. But you are right, Avi, also told we can use >> bitmap/counter itself as an indicator to decide whether we go ahead >> with yield_to at all. >> >> IMO, only patch(3) has some conflict because of various approach we can >> try.May be we should attack the problem via all 3 solutions at once and >> decide? >> >> To be frank, within each of the approach, trying/analyzing all the >> possibilities made the things slow.. (my end). >> >> Suggestions..? >> > > I agree, it's a complex problem that needs lots of trial+error work. We > should definitely work in parallel on multiple ideas. I'll go ahead and > dig into the preempt notifiers. > Okay. Thank you. I will concentrate on dynamic_ple window.. But I think implementation need some overlapping details from preempt notifier. For dynamic ple window, To summarize, what we thought of doing, ( I hope we have to keep the ple window between 4k - 16k throughout) From preempt notifiers: (1) from the preempt notifier check the overcommit case, if so increase the ple window questions: How do we say we are overcommitted? - is it number of preemption we keep track vs total vcpus. I think so. But we have to convert into some formula.. we shall decrease the ple window by some factor (unless we hit 4k) (2) How can say we are undercommitted: Perhaps there is very less number of vcpus that are scheduled out currently. we tend to set ple window closer to max (16k). From yield_to failures: if yield_to fails with ESRCH, it potentially indicate undercommit and we can again use logic of increasing ple window. Did we miss anything? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e376434..28bbdfb 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1697,15 +1697,43 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) } #endif +/* + * A load of 2048 corresponds to 1:1 overcommit + * undercommit threshold is half the 1:1 overcommit + * overcommit threshold is 1.75 times of 1:1 overcommit threshold + */ +#define COMMIT_THRESHOLD (FIXED_1) +#define UNDERCOMMIT_THRESHOLD (COMMIT_THRESHOLD >> 1) +#define OVERCOMMIT_THRESHOLD ((COMMIT_THRESHOLD << 1) - (COMMIT_THRESHOLD >> 2)) + +unsigned long kvm_system_load(void) +{ + unsigned long load; + + load = avenrun[0] + FIXED_1/200; + load = load / num_online_cpus(); + + return load; +} + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me->kvm; struct kvm_vcpu *vcpu; int last_boosted_vcpu = me->kvm->last_boosted_vcpu; int yielded = 0; + unsigned long load; int pass; int i; + load = kvm_system_load(); + /* + * When we are undercomitted let us not waste time in + * iterating over all the VCPUs. + */ + if (load < UNDERCOMMIT_THRESHOLD) + return; + kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not @@ -1735,6 +1763,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) break; } } + /* + * If we are not able to yield especially in overcommit cases + * let us be courteous to other VM's VCPUs waiting to be scheduled. + */ + if (!yielded && load > OVERCOMMIT_THRESHOLD) + yield(); + kvm_vcpu_set_in_spin_loop(me, false); /* Ensure vcpu is not eligible during next spinloop */