diff mbox

[RFC,2/2] kvm: Be courteous to other VMs in overcommitted scenario in PLE handler

Message ID 20120921120019.27611.66093.sendpatchset@codeblue (mailing list archive)
State New, archived
Headers show

Commit Message

Raghavendra K T Sept. 21, 2012, noon UTC
From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

When PLE handler fails to find a better candidate to yield_to, it
goes back and does spin again. This is acceptable when we do not
have overcommit.
But in overcommitted scenarios (especially when we have large
number of small guests), it is better to yield.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
---
 virt/kvm/kvm_main.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Rik van Riel Sept. 21, 2012, 1:22 p.m. UTC | #1
On 09/21/2012 08:00 AM, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
>
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>

Acked-by: Rik van Riel <riel@redhat.com>
Takuya Yoshikawa Sept. 21, 2012, 1:46 p.m. UTC | #2
On Fri, 21 Sep 2012 17:30:20 +0530
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8323685..713b677 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  		}
>  	}
> +	/* In overcommitted cases, yield instead of spinning */
> +	if (!yielded && rq_nr_running() > 1)
> +		schedule();

How about doing cond_resched() instead?

I'm not sure whether checking more sched stuff in KVM code is a
good thing.

	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rik van Riel Sept. 21, 2012, 1:52 p.m. UTC | #3
On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote:
> On Fri, 21 Sep 2012 17:30:20 +0530
> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>
>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>
>> When PLE handler fails to find a better candidate to yield_to, it
>> goes back and does spin again. This is acceptable when we do not
>> have overcommit.
>> But in overcommitted scenarios (especially when we have large
>> number of small guests), it is better to yield.
>>
>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>> ---
>>   virt/kvm/kvm_main.c |    4 ++++
>>   1 files changed, 4 insertions(+), 0 deletions(-)
>>
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index 8323685..713b677 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>   			}
>>   		}
>>   	}
>> +	/* In overcommitted cases, yield instead of spinning */
>> +	if (!yielded && rq_nr_running() > 1)
>> +		schedule();
>
> How about doing cond_resched() instead?

Actually, an actual call to yield() may be better.

That will set scheduler hints to make the scheduler pick
another task for one round, while preserving this task's
top position in the runqueue.
Raghavendra K T Sept. 21, 2012, 5:45 p.m. UTC | #4
On 09/21/2012 07:22 PM, Rik van Riel wrote:
> On 09/21/2012 09:46 AM, Takuya Yoshikawa wrote:
>> On Fri, 21 Sep 2012 17:30:20 +0530
>> Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:
>>
>>> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>>
>>> When PLE handler fails to find a better candidate to yield_to, it
>>> goes back and does spin again. This is acceptable when we do not
>>> have overcommit.
>>> But in overcommitted scenarios (especially when we have large
>>> number of small guests), it is better to yield.
>>>
>>> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>>> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
>>> ---
>>> virt/kvm/kvm_main.c | 4 ++++
>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index 8323685..713b677 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>> }
>>> }
>>> }
>>> + /* In overcommitted cases, yield instead of spinning */
>>> + if (!yielded && rq_nr_running() > 1)
>>> + schedule();
>>
>> How about doing cond_resched() instead?
>
> Actually, an actual call to yield() may be better.
>
> That will set scheduler hints to make the scheduler pick
> another task for one round, while preserving this task's
> top position in the runqueue.

I am not a scheduler expert, but I am also inclined towards
Rik's suggestion here since we set skip buddy here. Takuya?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Takuya Yoshikawa Sept. 24, 2012, 1:43 p.m. UTC | #5
On Fri, 21 Sep 2012 23:15:40 +0530
Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> wrote:

> >> How about doing cond_resched() instead?
> >
> > Actually, an actual call to yield() may be better.
> >
> > That will set scheduler hints to make the scheduler pick
> > another task for one round, while preserving this task's
> > top position in the runqueue.
> 
> I am not a scheduler expert, but I am also inclined towards
> Rik's suggestion here since we set skip buddy here. Takuya?
> 

Yes, I think it's better.
But I hope that experts in Cc will suggest the best way.

Thanks,
	Takuya
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 3:26 p.m. UTC | #6
On 09/21/2012 03:00 PM, Raghavendra K T wrote:
> From: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> 
> When PLE handler fails to find a better candidate to yield_to, it
> goes back and does spin again. This is acceptable when we do not
> have overcommit.
> But in overcommitted scenarios (especially when we have large
> number of small guests), it is better to yield.
> 
> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
> ---
>  virt/kvm/kvm_main.c |    4 ++++
>  1 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 8323685..713b677 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1660,6 +1660,10 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>  			}
>  		}
>  	}
> +	/* In overcommitted cases, yield instead of spinning */
> +	if (!yielded && rq_nr_running() > 1)
> +		schedule();
> +

I think this is a no-op these (CFS) days.  To get schedule() to do
anything, you need to wake up a task, or let time pass, or block.
Otherwise it will see that nothing has changed and as far as it's
concerned you're still the best task to be running (otherwise it
wouldn't have picked you in the first place).
Peter Zijlstra Sept. 24, 2012, 3:34 p.m. UTC | #7
On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote:
> I think this is a no-op these (CFS) days.  To get schedule() to do
> anything, you need to wake up a task, or let time pass, or block.
> Otherwise it will see that nothing has changed and as far as it's
> concerned you're still the best task to be running (otherwise it
> wouldn't have picked you in the first place). 

Time could have passed enough before calling this that there's now a
different/more eligible task around to schedule.

Esp. for a !PREEMPT kernel this is could be significant.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 3:43 p.m. UTC | #8
On 09/24/2012 05:34 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:26 +0200, Avi Kivity wrote:
>> I think this is a no-op these (CFS) days.  To get schedule() to do
>> anything, you need to wake up a task, or let time pass, or block.
>> Otherwise it will see that nothing has changed and as far as it's
>> concerned you're still the best task to be running (otherwise it
>> wouldn't have picked you in the first place). 
> 
> Time could have passed enough before calling this that there's now a
> different/more eligible task around to schedule.

Wouldn't this correspond to the scheduler interrupt firing and causing a
reschedule?  I thought the timer was programmed for exactly the point in
time that CFS considers the right time for a switch.  But I'm basing
this on my mental model of CFS, not CFS itself.

> Esp. for a !PREEMPT kernel this is could be significant.
Peter Zijlstra Sept. 24, 2012, 3:52 p.m. UTC | #9
On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote:
> Wouldn't this correspond to the scheduler interrupt firing and causing a
> reschedule?  I thought the timer was programmed for exactly the point in
> time that CFS considers the right time for a switch.  But I'm basing
> this on my mental model of CFS, not CFS itself. 

No, we tried this for hrtimer kernels for a while, but programming
hrtimers the whole time (every actual task-switch) turns out to be far
too expensive. So we're back to HZ ticks and 'polling' the preemption
state.

Even if we remove all the hrtimer infrastructure overhead (can do with a
few hacks) setting the hardware requires going out to the LAPIC, which
is stupid slow.

Some hardware actually has fast/reliable/usable timers, sadly none of it
is popular.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 3:58 p.m. UTC | #10
On 09/24/2012 05:52 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:43 +0200, Avi Kivity wrote:
>> Wouldn't this correspond to the scheduler interrupt firing and causing a
>> reschedule?  I thought the timer was programmed for exactly the point in
>> time that CFS considers the right time for a switch.  But I'm basing
>> this on my mental model of CFS, not CFS itself. 
> 
> No, we tried this for hrtimer kernels for a while, but programming
> hrtimers the whole time (every actual task-switch) turns out to be far
> too expensive. So we're back to HZ ticks and 'polling' the preemption
> state.

Ok, so I wasn't completely off base.

With HZ=1000, we can only be faster than the poll by a millisecond than
the interrupt-driven schedule(), and we need to be a lot faster.

> Even if we remove all the hrtimer infrastructure overhead (can do with a
> few hacks) setting the hardware requires going out to the LAPIC, which
> is stupid slow.
> 
> Some hardware actually has fast/reliable/usable timers, sadly none of it
> is popular.

There is the TSC deadline timer mode of newer Intels.  Programming the
timer is a simple wrmsr, and it will fire immediately if it already
expired.  Unfortunately on AMDs it is not available, and on virtual
hardware it will be slow (~1-2 usec).
Peter Zijlstra Sept. 24, 2012, 4:05 p.m. UTC | #11
On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote:
> There is the TSC deadline timer mode of newer Intels.  Programming the
> timer is a simple wrmsr, and it will fire immediately if it already
> expired.  Unfortunately on AMDs it is not available, and on virtual
> hardware it will be slow (~1-2 usec). 

Its also still a LAPIC write -- disguised as an MSR though :/

Also, who gives a hoot about virtual crap ;-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 4:10 p.m. UTC | #12
On 09/24/2012 06:05 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 17:58 +0200, Avi Kivity wrote:
>> There is the TSC deadline timer mode of newer Intels.  Programming the
>> timer is a simple wrmsr, and it will fire immediately if it already
>> expired.  Unfortunately on AMDs it is not available, and on virtual
>> hardware it will be slow (~1-2 usec). 
> 
> Its also still a LAPIC write -- disguised as an MSR though :/

It's probably a whole lot faster though.

> Also, who gives a hoot about virtual crap ;-)

I only mentioned it to see if your virtual crap detector is still
working.  Looks like it's still in top condition, low latency and 100%
hit rate.
Peter Zijlstra Sept. 24, 2012, 4:13 p.m. UTC | #13
On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
> > Its also still a LAPIC write -- disguised as an MSR though :/
> 
> It's probably a whole lot faster though. 

I've been told its not, I haven't tried it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 24, 2012, 4:21 p.m. UTC | #14
On 09/24/2012 06:13 PM, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
>> > Its also still a LAPIC write -- disguised as an MSR though :/
>> 
>> It's probably a whole lot faster though. 
> 
> I've been told its not, I haven't tried it.

I'll see if I can find a machine with it (don't see it on my Westmere,
it's probably on one of the Bridges.  Or maybe the other Peter knows.
Avi Kivity Sept. 25, 2012, 10:11 a.m. UTC | #15
On 09/24/2012 06:21 PM, Avi Kivity wrote:
> On 09/24/2012 06:13 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 18:10 +0200, Avi Kivity wrote:
> >> > Its also still a LAPIC write -- disguised as an MSR though :/
> >> 
> >> It's probably a whole lot faster though. 
> > 
> > I've been told its not, I haven't tried it.
>
> I'll see if I can find a machine with it (don't see it on my Westmere,
> it's probably on one of the Bridges.  Or maybe the other Peter knows.
>
>

Before measuring TSC_DEADLINE, I measured writing to TMICT, it costs 32
cycles.  This is on a Sandy Bridge.
diff mbox

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 8323685..713b677 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1660,6 +1660,10 @@  void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 			}
 		}
 	}
+	/* In overcommitted cases, yield instead of spinning */
+	if (!yielded && rq_nr_running() > 1)
+		schedule();
+
 	kvm_vcpu_set_in_spin_loop(me, false);
 
 	/* Ensure vcpu is not eligible during next spinloop */