Message ID | 20120921120019.27611.66093.sendpatchset@codeblue (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09/21/2012 08:00 AM, Raghavendra K T wrote: > From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > When PLE handler fails to find a better candidate to yield_to, it > goes back and does spin again. This is acceptable when we do not > have overcommit. > But in overcommitted scenarios (especially when we have large > number of small guests), it is better to yield. > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com>
On Fri, 21 Sep 2012 17:30:20 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > When PLE handler fails to find a better candidate to yield_to, it > goes back and does spin again. This is acceptable when we do not > have overcommit. > But in overcommitted scenarios (especially when we have large > number of small guests), it is better to yield. > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > --- > virt/kvm/kvm_main.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8323685..713b677 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) > } > } > } > + /* In overcommitted cases, yield instead of spinning */ > + if (!yielded && rq_nr_running() > 1) > + schedule(); How about doing cond_resched() instead? I'm not sure whether checking more sched stuff in KVM code is a good thing. Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote: > On Fri, 21 Sep 2012 17:30:20 +0530 > Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > >> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >> >> When PLE handler fails to find a better candidate to yield_to, it >> goes back and does spin again. This is acceptable when we do not >> have overcommit. >> But in overcommitted scenarios (especially when we have large >> number of small guests), it is better to yield. >> >> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >> --- >> virt/kvm/kvm_main.c | 4 ++++ >> 1 files changed, 4 insertions(+), 0 deletions(-) >> >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >> index 8323685..713b677 100644 >> --- a/virt/kvm/kvm_main.c >> +++ b/virt/kvm/kvm_main.c >> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) >> } >> } >> } >> + /* In overcommitted cases, yield instead of spinning */ >> + if (!yielded && rq_nr_running() > 1) >> + schedule(); > > How about doing cond_resched() instead? Actually, an actual call to yield() may be better. That will set scheduler hints to make the scheduler pick another task for one round, while preserving this task's top position in the runqueue.
On 09/21/2012 07:22 PM, Rik van Riel wrote: > On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote: >> On Fri, 21 Sep 2012 17:30:20 +0530 >> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: >> >>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >>> >>> When PLE handler fails to find a better candidate to yield_to, it >>> goes back and does spin again. This is acceptable when we do not >>> have overcommit. >>> But in overcommitted scenarios (especially when we have large >>> number of small guests), it is better to yield. >>> >>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> >>> --- >>> virt/kvm/kvm_main.c | 4 ++++ >>> 1 files changed, 4 insertions(+), 0 deletions(-) >>> >>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c >>> index 8323685..713b677 100644 >>> --- a/virt/kvm/kvm_main.c >>> +++ b/virt/kvm/kvm_main.c >>> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) >>> } >>> } >>> } >>> + /* In overcommitted cases, yield instead of spinning */ >>> + if (!yielded && rq_nr_running() > 1) >>> + schedule(); >> >> How about doing cond_resched() instead? > > Actually, an actual call to yield() may be better. > > That will set scheduler hints to make the scheduler pick > another task for one round, while preserving this task's > top position in the runqueue. I am not a scheduler expert, but I am also inclined towards Rik's suggestion here since we set skip buddy here. Takuya? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 21 Sep 2012 23:15:40 +0530 Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote: > >> How about doing cond_resched() instead? > > > > Actually, an actual call to yield() may be better. > > > > That will set scheduler hints to make the scheduler pick > > another task for one round, while preserving this task's > > top position in the runqueue. > > I am not a scheduler expert, but I am also inclined towards > Rik's suggestion here since we set skip buddy here. Takuya? > Yes, I think it's better. But I hope that experts in Cc will suggest the best way. Thanks, Takuya -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/21/2012 03:00 PM, Raghavendra K T wrote: > From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > When PLE handler fails to find a better candidate to yield_to, it > goes back and does spin again. This is acceptable when we do not > have overcommit. > But in overcommitted scenarios (especially when we have large > number of small guests), it is better to yield. > > Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > --- > virt/kvm/kvm_main.c | 4 ++++ > 1 files changed, 4 insertions(+), 0 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8323685..713b677 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) > } > } > } > + /* In overcommitted cases, yield instead of spinning */ > + if (!yielded && rq_nr_running() > 1) > + schedule(); > + I think this is a no-op these (CFS) days. To get schedule() to do anything, you need to wake up a task, or let time pass, or block. Otherwise it will see that nothing has changed and as far as it's concerned you're still the best task to be running (otherwise it wouldn't have picked you in the first place).
On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote: > I think this is a no-op these (CFS) days. To get schedule() to do > anything, you need to wake up a task, or let time pass, or block. > Otherwise it will see that nothing has changed and as far as it's > concerned you're still the best task to be running (otherwise it > wouldn't have picked you in the first place). Time could have passed enough before calling this that there's now a different/more eligible task around to schedule. Esp. for a !PREEMPT kernel this is could be significant. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/24/2012 05:34 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote: >> I think this is a no-op these (CFS) days. To get schedule() to do >> anything, you need to wake up a task, or let time pass, or block. >> Otherwise it will see that nothing has changed and as far as it's >> concerned you're still the best task to be running (otherwise it >> wouldn't have picked you in the first place). > > Time could have passed enough before calling this that there's now a > different/more eligible task around to schedule. Wouldn't this correspond to the scheduler interrupt firing and causing a reschedule? I thought the timer was programmed for exactly the point in time that CFS considers the right time for a switch. But I'm basing this on my mental model of CFS, not CFS itself. > Esp. for a !PREEMPT kernel this is could be significant.
On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote: > Wouldn't this correspond to the scheduler interrupt firing and causing a > reschedule? I thought the timer was programmed for exactly the point in > time that CFS considers the right time for a switch. But I'm basing > this on my mental model of CFS, not CFS itself. No, we tried this for hrtimer kernels for a while, but programming hrtimers the whole time (every actual task-switch) turns out to be far too expensive. So we're back to HZ ticks and 'polling' the preemption state. Even if we remove all the hrtimer infrastructure overhead (can do with a few hacks) setting the hardware requires going out to the LAPIC, which is stupid slow. Some hardware actually has fast/reliable/usable timers, sadly none of it is popular. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/24/2012 05:52 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote: >> Wouldn't this correspond to the scheduler interrupt firing and causing a >> reschedule? I thought the timer was programmed for exactly the point in >> time that CFS considers the right time for a switch. But I'm basing >> this on my mental model of CFS, not CFS itself. > > No, we tried this for hrtimer kernels for a while, but programming > hrtimers the whole time (every actual task-switch) turns out to be far > too expensive. So we're back to HZ ticks and 'polling' the preemption > state. Ok, so I wasn't completely off base. With HZ=1000, we can only be faster than the poll by a millisecond than the interrupt-driven schedule(), and we need to be a lot faster. > Even if we remove all the hrtimer infrastructure overhead (can do with a > few hacks) setting the hardware requires going out to the LAPIC, which > is stupid slow. > > Some hardware actually has fast/reliable/usable timers, sadly none of it > is popular. There is the TSC deadline timer mode of newer Intels. Programming the timer is a simple wrmsr, and it will fire immediately if it already expired. Unfortunately on AMDs it is not available, and on virtual hardware it will be slow (~1-2 usec).
On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote: > There is the TSC deadline timer mode of newer Intels. Programming the > timer is a simple wrmsr, and it will fire immediately if it already > expired. Unfortunately on AMDs it is not available, and on virtual > hardware it will be slow (~1-2 usec). Its also still a LAPIC write -- disguised as an MSR though :/ Also, who gives a hoot about virtual crap ;-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/24/2012 06:05 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote: >> There is the TSC deadline timer mode of newer Intels. Programming the >> timer is a simple wrmsr, and it will fire immediately if it already >> expired. Unfortunately on AMDs it is not available, and on virtual >> hardware it will be slow (~1-2 usec). > > Its also still a LAPIC write -- disguised as an MSR though :/ It's probably a whole lot faster though. > Also, who gives a hoot about virtual crap ;-) I only mentioned it to see if your virtual crap detector is still working. Looks like it's still in top condition, low latency and 100% hit rate.
On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote: > > Its also still a LAPIC write -- disguised as an MSR though :/ > > It's probably a whole lot faster though. I've been told its not, I haven't tried it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/24/2012 06:13 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote: >> > Its also still a LAPIC write -- disguised as an MSR though :/ >> >> It's probably a whole lot faster though. > > I've been told its not, I haven't tried it. I'll see if I can find a machine with it (don't see it on my Westmere, it's probably on one of the Bridges. Or maybe the other Peter knows.
On 09/24/2012 06:21 PM, Avi Kivity wrote: > On 09/24/2012 06:13 PM, Peter Zijlstra wrote: > > On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote: > >> > Its also still a LAPIC write -- disguised as an MSR though :/ > >> > >> It's probably a whole lot faster though. > > > > I've been told its not, I haven't tried it. > > I'll see if I can find a machine with it (don't see it on my Westmere, > it's probably on one of the Bridges. Or maybe the other Peter knows. > > Before measuring TSC_DEADLINE, I measured writing to TMICT, it costs 32 cycles. This is on a Sandy Bridge.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8323685..713b677 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) } } } + /* In overcommitted cases, yield instead of spinning */ + if (!yielded && rq_nr_running() > 1) + schedule(); + kvm_vcpu_set_in_spin_loop(me, false); /* Ensure vcpu is not eligible during next spinloop */