diff mbox

[retry,1] Add support for Pause Filtering to AMD SVM

Message ID 200905071000.14038.mark.langsdorf@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Langsdorf May 7, 2009, 3 p.m. UTC
commit 01813db8627e74018c8cec90df7e345839351f23
Author: root <root@xendinar01.amd.com>
Date:   Thu May 7 09:44:10 2009 -0500

    New AMD processors will support the Pause Filter Feature.
    This feature creates a new field in the VMCB called Pause
    Filter Count.  If Pause Filter Count is greater than 0 and
    intercepting PAUSEs is enabled, the processor will increment
    an internal counter when a PAUSE instruction occurs instead
    of intercepting.  When the internal counter reaches the
    Pause Filter Count value, a PAUSE intercept will occur.
    
    This feature can be used to detect contended spinlocks,
    especially when the lock holding VCPU is not scheduled.
    Rescheduling another VCPU prevents the VCPU seeking the
    lock from wasting its quantum by spinning idly.
    
    Experimental results show that most spinlocks are held
    for less than 1000 PAUSE cycles or more than a few
    thousand.  Default the Pause Filter Counter to 3000 to
    detect the contended spinlocks.
    
    Processor support for this feature is indicated by a CPUID
    bit.
    
    On a 24 core system running 4 guests each with 16 VCPUs,
    this patch improved overall performance of each guest's
    32 job kernbench by approximately 1%.  Further performance
    improvement may be possible with a more sophisticated
    yield algorithm.
    
    -Mark Langsdorf
    Operating System Research Center
    AMD
    
    Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Avi Kivity May 7, 2009, 3:31 p.m. UTC | #1
(copying Ingo)

Mark Langsdorf wrote:
> commit 01813db8627e74018c8cec90df7e345839351f23
> Author: root <root@xendinar01.amd.com>
> Date:   Thu May 7 09:44:10 2009 -0500
>
>     New AMD processors will support the Pause Filter Feature.
>     This feature creates a new field in the VMCB called Pause
>     Filter Count.  If Pause Filter Count is greater than 0 and
>     intercepting PAUSEs is enabled, the processor will increment
>     an internal counter when a PAUSE instruction occurs instead
>     of intercepting.  When the internal counter reaches the
>     Pause Filter Count value, a PAUSE intercept will occur.
>     
>     This feature can be used to detect contended spinlocks,
>     especially when the lock holding VCPU is not scheduled.
>     Rescheduling another VCPU prevents the VCPU seeking the
>     lock from wasting its quantum by spinning idly.
>     
>     Experimental results show that most spinlocks are held
>     for less than 1000 PAUSE cycles or more than a few
>     thousand.  Default the Pause Filter Counter to 3000 to
>     detect the contended spinlocks.
>     
>     Processor support for this feature is indicated by a CPUID
>     bit.
>     
>     On a 24 core system running 4 guests each with 16 VCPUs,
>     this patch improved overall performance of each guest's
>     32 job kernbench by approximately 1%.  Further performance
>     improvement may be possible with a more sophisticated
>     yield algorithm.
>     
>     -Mark Langsdorf
>     Operating System Research Center
>     AMD
>     
>     Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
>   

(please use git format-patch rather than git show, and set up user.name 
and user.email properly)

>  
>  	svm->nested_vmcb = 0;
>  	svm->vcpu.arch.hflags = HF_GIF_MASK;
> +
> +	if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
> +		control->pause_filter_count = 5000;
> +		control->intercept |= (1ULL << INTERCEPT_PAUSE);
> +	}
> +
>  }

3000 or 5000?

>  
> +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
> +{
> +	/* Simple yield */
> +	vcpu_put(&svm->vcpu);
> +	schedule();
> +	vcpu_load(&svm->vcpu);
> +	return 1;
> +

Ingo, will this do anything under CFS, or will CFS note that nothing has 
changed in the accounting and reschedule us immediately?
Ingo Molnar May 11, 2009, 2:15 p.m. UTC | #2
* Avi Kivity <avi@redhat.com> wrote:

>>  +static int pause_interception(struct vcpu_svm *svm, struct kvm_run 
>> *kvm_run)
>> +{
>> +	/* Simple yield */
>> +	vcpu_put(&svm->vcpu);
>> +	schedule();
>> +	vcpu_load(&svm->vcpu);
>> +	return 1;
>> +
>
> Ingo, will this do anything under CFS, or will CFS note that 
> nothing has changed in the accounting and reschedule us 
> immediately?

The scheduler will yield to another task only if the current task 
has become ineligible. I.e schedule() is largely a NOP on 
TASK_RUNNING tasks (i.e. here).

I.e. this is a somewhat poor solution as far as scheduling goes. But 
i'm wondering what the CPU side does. Can REP-NOP really take 
thousands of cycles? If yes, under what circumstances?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 11, 2009, 2:24 p.m. UTC | #3
Ingo Molnar wrote:
>   
>>>  +static int pause_interception(struct vcpu_svm *svm, struct kvm_run 
>>> *kvm_run)
>>> +{
>>> +	/* Simple yield */
>>> +	vcpu_put(&svm->vcpu);
>>> +	schedule();
>>> +	vcpu_load(&svm->vcpu);
>>> +	return 1;
>>> +
>>>       
>> Ingo, will this do anything under CFS, or will CFS note that 
>> nothing has changed in the accounting and reschedule us 
>> immediately?
>>     
>
> The scheduler will yield to another task only if the current task 
> has become ineligible. I.e schedule() is largely a NOP on 
> TASK_RUNNING tasks (i.e. here).
>   

Especially on preemptible kernels, where the schedule() would have 
already happened if it could cause anything, IIUC.

> I.e. this is a somewhat poor solution as far as scheduling goes. But 
> i'm wondering what the CPU side does. Can REP-NOP really take 
> thousands of cycles? If yes, under what circumstances?
>   

The guest is running rep-nop in a loop while trying to acquire a 
spinlock.  The hardware detects this (most likely, repeated rep-nop with 
the same rip) and exits.  We can program the loop count; obviously if 
we're spinning for only a short while it's better to keep spinning while 
hoping the lock will be released soon.

The idea is to detect that the guest is not making forward progress and 
yield.  If I could tell the scheduler, you may charge me a couple of 
milliseconds, I promise not to sue, that would be ideal.  Other tasks 
can become eligible, hopefully the task holding the spinlock, and by the 
time we're scheduled back the long running task will have finished and 
released the lock.

For newer Linux as a guest we're better off paravirtualizing this, so we 
can tell the host which vcpu holds the lock; in this case kvm will want 
to say, take a couple milliseconds off my account and transfer it to 
this task (so called directed yield).  However there's no reason to 
paravirtualize all cpu_relax() calls.
Ingo Molnar May 11, 2009, 2:33 p.m. UTC | #4
* Avi Kivity <avi@redhat.com> wrote:

>> I.e. this is a somewhat poor solution as far as scheduling goes. 
>> But i'm wondering what the CPU side does. Can REP-NOP really take 
>> thousands of cycles? If yes, under what circumstances?
>
> The guest is running rep-nop in a loop while trying to acquire a 
> spinlock.  The hardware detects this (most likely, repeated 
> rep-nop with the same rip) and exits.  We can program the loop 
> count; obviously if we're spinning for only a short while it's 
> better to keep spinning while hoping the lock will be released 
> soon.
>
> The idea is to detect that the guest is not making forward 
> progress and yield.  If I could tell the scheduler, you may charge 
> me a couple of milliseconds, I promise not to sue, that would be 
> ideal. [...]

Ok, with such a waiver, who could refuse?

This really needs a new kernel-internal scheduler API though, which 
does a lot of fancy things to do:

        se->vruntime += 1000000;

i.e. add 1 msec worth of nanoseconds to the task's timeline. (first 
remove it from the rbtree, then add it back, and nice-weight it as 
well) And only do it if there's other tasks running on this CPU or 
so.

_That_ would be pretty efficient, and would do the right thing when 
two (or more) vcpus run on the same CPU, and it would also do the 
right thing if there are repeated VM-exits due to pause filtering.

Please dont even think about using yield for this though - that will 
just add a huge hit to this task and wont result in any sane 
behavior - and yield is bound to some historic user-space behavior 
as well.

A gradual and linear back-off from the current timeline is more of a 
fair negotiation process between vcpus and results in more or less 
sane (and fair) scheduling, and no unnecessary looping.

You could even do an exponential backoff up to a limit of 1-10 msecs 
or so, starting at 100 usecs.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 11, 2009, 2:42 p.m. UTC | #5
On Mon, 2009-05-11 at 17:24 +0300, Avi Kivity wrote:

> > I.e. this is a somewhat poor solution as far as scheduling goes. But 
> > i'm wondering what the CPU side does. Can REP-NOP really take 
> > thousands of cycles? If yes, under what circumstances?
> >   
> 
> The guest is running rep-nop in a loop while trying to acquire a 
> spinlock.  The hardware detects this (most likely, repeated rep-nop with 
> the same rip) and exits.  We can program the loop count; obviously if 
> we're spinning for only a short while it's better to keep spinning while 
> hoping the lock will be released soon.
> 
> The idea is to detect that the guest is not making forward progress and 
> yield.  If I could tell the scheduler, you may charge me a couple of 
> milliseconds, I promise not to sue, that would be ideal.  Other tasks 
> can become eligible, hopefully the task holding the spinlock, and by the 
> time we're scheduled back the long running task will have finished and 
> released the lock.
> 
> For newer Linux as a guest we're better off paravirtualizing this, so we 
> can tell the host which vcpu holds the lock; in this case kvm will want 
> to say, take a couple milliseconds off my account and transfer it to 
> this task (so called directed yield).  However there's no reason to 
> paravirtualize all cpu_relax() calls.

So we're now officially giving up on (soft) realtime virtualization?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 11, 2009, 2:51 p.m. UTC | #6
Ingo Molnar wrote:
> * Avi Kivity <avi@redhat.com> wrote:
>
>   
>>> I.e. this is a somewhat poor solution as far as scheduling goes. 
>>> But i'm wondering what the CPU side does. Can REP-NOP really take 
>>> thousands of cycles? If yes, under what circumstances?
>>>       
>> The guest is running rep-nop in a loop while trying to acquire a 
>> spinlock.  The hardware detects this (most likely, repeated 
>> rep-nop with the same rip) and exits.  We can program the loop 
>> count; obviously if we're spinning for only a short while it's 
>> better to keep spinning while hoping the lock will be released 
>> soon.
>>
>> The idea is to detect that the guest is not making forward 
>> progress and yield.  If I could tell the scheduler, you may charge 
>> me a couple of milliseconds, I promise not to sue, that would be 
>> ideal. [...]
>>     
>
> Ok, with such a waiver, who could refuse?
>
> This really needs a new kernel-internal scheduler API though, which 
> does a lot of fancy things to do:
>
>         se->vruntime += 1000000;
>
> i.e. add 1 msec worth of nanoseconds to the task's timeline. (first 
> remove it from the rbtree, then add it back, and nice-weight it as 
> well) 

I suspected it would be as simple as this.

> And only do it if there's other tasks running on this CPU or 
> so.
>   

What would happen if there weren't?  I'd guess the task would continue 
running (but with a warped vruntime)?

> _That_ would be pretty efficient, and would do the right thing when 
> two (or more) vcpus run on the same CPU, and it would also do the 
> right thing if there are repeated VM-exits due to pause filtering.
>
> Please dont even think about using yield for this though - that will 
> just add a huge hit to this task and wont result in any sane 
> behavior - and yield is bound to some historic user-space behavior 
> as well.
>
> A gradual and linear back-off from the current timeline is more of a 
> fair negotiation process between vcpus and results in more or less 
> sane (and fair) scheduling, and no unnecessary looping.
>
> You could even do an exponential backoff up to a limit of 1-10 msecs 
> or so, starting at 100 usecs.
>   

Good idea, it eliminates another variable to be tuned.
Ingo Molnar May 11, 2009, 2:59 p.m. UTC | #7
* Avi Kivity <avi@redhat.com> wrote:

>> And only do it if there's other tasks running on this CPU or so.
>
> What would happen if there weren't?  I'd guess the task would 
> continue running (but with a warped vruntime)?

We dont want that warping to occur - we just want to go back and 
burn CPU time in VM context. The problem is (as Peter pointed it 
out) that this hw facility is incomplete and does not give us any 
event (interrupt) and does not give us any event key (address we are 
waiting for) either.

So the next best thing to do is to go back to the guest, because 
that is where we make the most progress, more likely, and that is 
where we want to be to make progress immediately, with the shortest 
latency.

( Perhaps we could also increase vruntime beyond the standard 
  latency value to make sure any freshly woken task gets executed 
  first if we are still looping. )

>> _That_ would be pretty efficient, and would do the right thing when  
>> two (or more) vcpus run on the same CPU, and it would also do the  
>> right thing if there are repeated VM-exits due to pause filtering.
>>
>> Please dont even think about using yield for this though - that will  
>> just add a huge hit to this task and wont result in any sane behavior - 
>> and yield is bound to some historic user-space behavior as well.
>>
>> A gradual and linear back-off from the current timeline is more of a  
>> fair negotiation process between vcpus and results in more or less  
>> sane (and fair) scheduling, and no unnecessary looping.
>>
>> You could even do an exponential backoff up to a limit of 1-10 msecs  
>> or so, starting at 100 usecs.
>>   
>
> Good idea, it eliminates another variable to be tuned.

It could be made fully self-tuning, if the filter threshold can be 
tuned fast enough. (an MSR write? A VM context field update?)

I.e. the 3000 cycles value itself could be eliminated as well. (with 
just a common-sense max of say 100,000 cycles enforced)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 11, 2009, 3:01 p.m. UTC | #8
On Mon, 2009-05-11 at 17:51 +0300, Avi Kivity wrote:
> Ingo Molnar wrote:
> > * Avi Kivity <avi@redhat.com> wrote:
> >
> >   
> >>> I.e. this is a somewhat poor solution as far as scheduling goes. 
> >>> But i'm wondering what the CPU side does. Can REP-NOP really take 
> >>> thousands of cycles? If yes, under what circumstances?
> >>>       
> >> The guest is running rep-nop in a loop while trying to acquire a 
> >> spinlock.  The hardware detects this (most likely, repeated 
> >> rep-nop with the same rip) and exits.  We can program the loop 
> >> count; obviously if we're spinning for only a short while it's 
> >> better to keep spinning while hoping the lock will be released 
> >> soon.
> >>
> >> The idea is to detect that the guest is not making forward 
> >> progress and yield.  If I could tell the scheduler, you may charge 
> >> me a couple of milliseconds, I promise not to sue, that would be 
> >> ideal. [...]
> >>     
> >
> > Ok, with such a waiver, who could refuse?
> >
> > This really needs a new kernel-internal scheduler API though, which 
> > does a lot of fancy things to do:
> >
> >         se->vruntime += 1000000;
> >
> > i.e. add 1 msec worth of nanoseconds to the task's timeline. (first 
> > remove it from the rbtree, then add it back, and nice-weight it as 
> > well) 
> 
> I suspected it would be as simple as this.

Is that thread guaranteed to run as SCHED_OTHER?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 11, 2009, 3:05 p.m. UTC | #9
Peter Zijlstra wrote:
> On Mon, 2009-05-11 at 17:24 +0300, Avi Kivity wrote:
>
>   
>>> I.e. this is a somewhat poor solution as far as scheduling goes. But 
>>> i'm wondering what the CPU side does. Can REP-NOP really take 
>>> thousands of cycles? If yes, under what circumstances?
>>>   
>>>       
>> The guest is running rep-nop in a loop while trying to acquire a 
>> spinlock.  The hardware detects this (most likely, repeated rep-nop with 
>> the same rip) and exits.  We can program the loop count; obviously if 
>> we're spinning for only a short while it's better to keep spinning while 
>> hoping the lock will be released soon.
>>
>> The idea is to detect that the guest is not making forward progress and 
>> yield.  If I could tell the scheduler, you may charge me a couple of 
>> milliseconds, I promise not to sue, that would be ideal.  Other tasks 
>> can become eligible, hopefully the task holding the spinlock, and by the 
>> time we're scheduled back the long running task will have finished and 
>> released the lock.
>>
>> For newer Linux as a guest we're better off paravirtualizing this, so we 
>> can tell the host which vcpu holds the lock; in this case kvm will want 
>> to say, take a couple milliseconds off my account and transfer it to 
>> this task (so called directed yield).  However there's no reason to 
>> paravirtualize all cpu_relax() calls.
>>     
>
> So we're now officially giving up on (soft) realtime virtualization?
>
>   

Wouldn't realtime guests be in a realtime scheduling class?  That ought 
to ignore this time_yield() (or however it is eventually named).
Avi Kivity May 11, 2009, 3:06 p.m. UTC | #10
Peter Zijlstra wrote:
>>> This really needs a new kernel-internal scheduler API though, which 
>>> does a lot of fancy things to do:
>>>
>>>         se->vruntime += 1000000;
>>>
>>> i.e. add 1 msec worth of nanoseconds to the task's timeline. (first 
>>> remove it from the rbtree, then add it back, and nice-weight it as 
>>> well) 
>>>       
>> I suspected it would be as simple as this.
>>     
>
> Is that thread guaranteed to run as SCHED_OTHER?
>   

No, it's user specified.  But if it isn't SCHED_OTHER, we don't mind if 
it spins (and we hope realtime systems aren't overcommitted).
Avi Kivity May 11, 2009, 3:12 p.m. UTC | #11
Ingo Molnar wrote:
> * Avi Kivity <avi@redhat.com> wrote:
>
>   
>>> And only do it if there's other tasks running on this CPU or so.
>>>       
>> What would happen if there weren't?  I'd guess the task would 
>> continue running (but with a warped vruntime)?
>>     
>
> We dont want that warping to occur - we just want to go back and 
> burn CPU time in VM context. The problem is (as Peter pointed it 
> out) that this hw facility is incomplete and does not give us any 
> event (interrupt) and does not give us any event key (address we are 
> waiting for) either.
>
> So the next best thing to do is to go back to the guest, because 
> that is where we make the most progress, more likely, and that is 
> where we want to be to make progress immediately, with the shortest 
> latency.
>   

Right, but I thought vruntime += blah would go back into the guest if 
there were no other runnable tasks.

Oh I see -- we'd exit immediately and warp vruntime very fast as long as 
we're spinning.

> ( Perhaps we could also increase vruntime beyond the standard 
>   latency value to make sure any freshly woken task gets executed 
>   first if we are still looping. )
>   

We could be halfway through our remaining time.  We could set it to the 
next task + epsilon.

But that may be too aggressive.  If the lock holder is on another cpu, 
it may well complete before the standard latency value expires.  Since 
we're giving up cpu time potentially to other guests, we don't want to 
give too much.

If we wake up too soon, we'll spin for a few microseconds and yield 
again.  So I think your 100 us + exponential backoff is best.

>>> _That_ would be pretty efficient, and would do the right thing when  
>>> two (or more) vcpus run on the same CPU, and it would also do the  
>>> right thing if there are repeated VM-exits due to pause filtering.
>>>
>>> Please dont even think about using yield for this though - that will  
>>> just add a huge hit to this task and wont result in any sane behavior - 
>>> and yield is bound to some historic user-space behavior as well.
>>>
>>> A gradual and linear back-off from the current timeline is more of a  
>>> fair negotiation process between vcpus and results in more or less  
>>> sane (and fair) scheduling, and no unnecessary looping.
>>>
>>> You could even do an exponential backoff up to a limit of 1-10 msecs  
>>> or so, starting at 100 usecs.
>>>   
>>>       
>> Good idea, it eliminates another variable to be tuned.
>>     
>
> It could be made fully self-tuning, if the filter threshold can be 
> tuned fast enough. (an MSR write? A VM context field update?)
>   

The latter.

> I.e. the 3000 cycles value itself could be eliminated as well. (with 
> just a common-sense max of say 100,000 cycles enforced)
>   

Yeah, though that has a much smaller effect as it's only responsible for 
a few microseconds of spinning.
Ingo Molnar May 11, 2009, 3:18 p.m. UTC | #12
* Avi Kivity <avi@redhat.com> wrote:

>> I.e. the 3000 cycles value itself could be eliminated as well. 
>> (with just a common-sense max of say 100,000 cycles enforced)
>
> Yeah, though that has a much smaller effect as it's only 
> responsible for a few microseconds of spinning.

3000 cycles would be 1-2 usecs. Isnt the VM exit+entry cost still in 
that range?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 11, 2009, 3:28 p.m. UTC | #13
Ingo Molnar wrote:
> * Avi Kivity <avi@redhat.com> wrote:
>
>   
>>> I.e. the 3000 cycles value itself could be eliminated as well. 
>>> (with just a common-sense max of say 100,000 cycles enforced)
>>>       
>> Yeah, though that has a much smaller effect as it's only 
>> responsible for a few microseconds of spinning.
>>     
>
> 3000 cycles would be 1-2 usecs. Isnt the VM exit+entry cost still in 
> that range?
>   

It's 3000 executions of rep nop, so you need to account for the entire 
spinlock loop body.

The Linux spinlock is

             "1:\t"
             "cmpl %0, %2\n\t"
             "je 2f\n\t"
             "rep ; nop\n\t"
             "movzwl %1, %2\n\t"
             /* don't need lfence here, because loads are in-order */
             "jmp 1b\n"

5 instructions, maybe 2-3 cycles, not counting any special rep nop 
overhead.  Mark, any idea what the spin time is?

VM entry/exit is around 1us on the newer processors.
Mark Langsdorf May 11, 2009, 3:36 p.m. UTC | #14
> Ingo Molnar wrote:
> > * Avi Kivity <avi@redhat.com> wrote:
> >
> >   
> >>> I.e. the 3000 cycles value itself could be eliminated as well. 
> >>> (with just a common-sense max of say 100,000 cycles enforced)
> >>>       
> >> Yeah, though that has a much smaller effect as it's only 
> >> responsible for a few microseconds of spinning.
> >>     
> >
> > 3000 cycles would be 1-2 usecs. Isnt the VM exit+entry cost 
> > still in that range?

For the processors that support this feature, VM exit+entry is
a little over 2000 cycles.
> 
> It's 3000 executions of rep nop, so you need to account for 
> the entire 
> spinlock loop body.
> 
> The Linux spinlock is
> 
>              "1:\t"
>              "cmpl %0, %2\n\t"
>              "je 2f\n\t"
>              "rep ; nop\n\t"
>              "movzwl %1, %2\n\t"
>              /* don't need lfence here, because loads are in-order */
>              "jmp 1b\n"
> 
> 5 instructions, maybe 2-3 cycles, not counting any special rep nop 
> overhead.  Mark, any idea what the spin time is?

If I'm understanding the question right, the contested
spin locks are being held for 5K to 10K iterations of PAUSE.
So 10K to 30K cycles if your estimate of the spinlock
cycle time is correct.  

-Mark Langsdorf
Operating System Research Center
AMD

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 11, 2009, 3:40 p.m. UTC | #15
Langsdorf, Mark wrote:
>> The Linux spinlock is
>>
>>              "1:\t"
>>              "cmpl %0, %2\n\t"
>>              "je 2f\n\t"
>>              "rep ; nop\n\t"
>>              "movzwl %1, %2\n\t"
>>              /* don't need lfence here, because loads are in-order */
>>              "jmp 1b\n"
>>
>> 5 instructions, maybe 2-3 cycles, not counting any special rep nop 
>> overhead.  Mark, any idea what the spin time is?
>>     
>
> If I'm understanding the question right, the contested
> spin locks are being held for 5K to 10K iterations of PAUSE.
> So 10K to 30K cycles if your estimate of the spinlock
> cycle time is correct.  
>   

My estimate is not very reliable.  Can you measure this?

btw, I'd expect you'd get much more significant improvement on Windows.  
Booting 16-vcpu Windows 2008 x64 took forever on my dual core host, and 
my guess is spinlock contention.  Of course you'd need a working yield, 
as Ingo said schedule() does very little.
Mark Langsdorf May 11, 2009, 3:58 p.m. UTC | #16
> >> Please dont even think about using yield for this though - 

Oops.  I'm glad I waited to get some benchmark results before
submitting that version.

> >> A gradual and linear back-off from the current timeline is 
> >> more of a fair negotiation process between vcpus and
> >> results in more or less  
> >> sane (and fair) scheduling, and no unnecessary looping.
> >>
> >> You could even do an exponential backoff up to a limit of 
> >> 1-10 msecs or so, starting at 100 usecs.
> >
> > Good idea, it eliminates another variable to be tuned.
> 
> It could be made fully self-tuning, if the filter threshold can be 
> tuned fast enough. (an MSR write? A VM context field update?)

VMCB field update.

So the suggestion is to add a function similar to set_task_cpu()
that increases the vmruntime with an exponential backoff?  Is
that sufficient to cause a new VCPU to step in?  I'm obviously
not very familiar with the scheduler code.
 
> I.e. the 3000 cycles value itself could be eliminated as well. (with 
> just a common-sense max of say 100,000 cycles enforced)

I don't understand what you're saying here.  There needs to be
some value in the pause filter counter to trigger the intercept.

-Mark Langsdorf
Operating System Research Center
AMD

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 85574b7..1fecb7e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -57,7 +57,8 @@  struct __attribute__ ((__packed__)) vmcb_control_area {
 	u16 intercept_dr_write;
 	u32 intercept_exceptions;
 	u64 intercept;
-	u8 reserved_1[44];
+	u8 reserved_1[42];
+	u16 pause_filter_count;
 	u64 iopm_base_pa;
 	u64 msrpm_base_pa;
 	u64 tsc_offset;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ef43a18..4279141 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -45,6 +45,7 @@  MODULE_LICENSE("GPL");
 #define SVM_FEATURE_NPT  (1 << 0)
 #define SVM_FEATURE_LBRV (1 << 1)
 #define SVM_FEATURE_SVML (1 << 2)
+#define SVM_FEATURE_PAUSE_FILTER (1 << 10)
 
 #define DEBUGCTL_RESERVED_BITS (~(0x3fULL))
 
@@ -575,6 +576,12 @@  static void init_vmcb(struct vcpu_svm *svm)
 
 	svm->nested_vmcb = 0;
 	svm->vcpu.arch.hflags = HF_GIF_MASK;
+
+	if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+		control->pause_filter_count = 5000;
+		control->intercept |= (1ULL << INTERCEPT_PAUSE);
+	}
+
 }
 
 static int svm_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -2087,6 +2094,15 @@  static int interrupt_window_interception(struct vcpu_svm *svm,
 	return 1;
 }
 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+	/* Simple yield */
+	vcpu_put(&svm->vcpu);
+	schedule();
+	vcpu_load(&svm->vcpu);
+	return 1;
+}
+
 static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
 				      struct kvm_run *kvm_run) = {
 	[SVM_EXIT_READ_CR0]           		= emulate_on_interception,
@@ -2123,6 +2139,7 @@  static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
 	[SVM_EXIT_CPUID]			= cpuid_interception,
 	[SVM_EXIT_IRET]                         = iret_interception,
 	[SVM_EXIT_INVD]                         = emulate_on_interception,
+	[SVM_EXIT_PAUSE]			= pause_interception,
 	[SVM_EXIT_HLT]				= halt_interception,
 	[SVM_EXIT_INVLPG]			= invlpg_interception,
 	[SVM_EXIT_INVLPGA]			= invalid_op_interception,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2b73e19..e2b730d 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -710,6 +710,7 @@  void vcpu_load(struct kvm_vcpu *vcpu)
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
 }
+EXPORT_SYMBOL_GPL(vcpu_load);
 
 void vcpu_put(struct kvm_vcpu *vcpu)
 {
@@ -719,6 +720,7 @@  void vcpu_put(struct kvm_vcpu *vcpu)
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
 }
+EXPORT_SYMBOL_GPL(vcpu_put);
 
 static void ack_flush(void *_completed)
 {