diff mbox

[retry,3] Add support for Pause Filtering to AMD SVM

Message ID 200905191356.37071.mark.langsdorf@amd.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mark Langsdorf May 19, 2009, 6:56 p.m. UTC
From 67f831e825b64be5dedae9936ff8a60b884959f2 Mon Sep 17 00:00:00 2001
From: mark.langsdorf@amd.com 
Date: Tue, 19 May 2009 07:46:11 -0500
Subject: [PATCH]

This feature creates a new field in the VMCB called Pause
Filter Count.  If Pause Filter Count is greater than 0 and
intercepting PAUSEs is enabled, the processor will increment
an internal counter when a PAUSE instruction occurs instead
of intercepting.  When the internal counter reaches the
Pause Filter Count value, a PAUSE intercept will occur.

This feature can be used to detect contended spinlocks,
especially when the lock holding VCPU is not scheduled.
Rescheduling another VCPU prevents the VCPU seeking the
lock from wasting its quantum by spinning idly.  Perform
the reschedule by increasing the the credited time on
the VCPU.

Experimental results show that most spinlocks are held
for less than 1000 PAUSE cycles or more than a few
thousand.  Default the Pause Filter Counter to 5000 to
detect the contended spinlocks.

Processor support for this feature is indicated by a CPUID
bit.

On a 24 core system running 4 guests each with 16 VCPUs,
this patch improved overall performance of each guest's
32 job kernbench by approximately 1%.  Further performance
improvement may be possible with a more sophisticated
yield algorithm.

-Mark Langsdorf
Operating System Research Center
AMD

Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
---
 arch/x86/include/asm/svm.h |    3 ++-
 arch/x86/kvm/svm.c         |   13 +++++++++++++
 include/linux/sched.h      |    7 +++++++
 kernel/sched.c             |    5 +++++
 4 files changed, 27 insertions(+), 1 deletions(-)

Comments

Ingo Molnar May 20, 2009, 7:40 a.m. UTC | #1
* Mark Langsdorf <mark.langsdorf@amd.com> wrote:

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b4c38bc..683bc65 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2283,6 +2283,9 @@ static inline unsigned int task_cpu(const struct task_struct *p)
>  	return task_thread_info(p)->cpu;
>  }
>  
> +extern void set_task_delay(struct task_struct *p, unsigned int delay);

> +++ b/kernel/sched.c
> @@ -1947,6 +1947,11 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +void set_task_delay(struct task_struct *p, unsigned int delay)
> +{
> +	p->se.vruntime += delay;
> +}
> +EXPORT_SYMBOL(set_task_delay);

vruntime is nice level scaled, so this is broken. Please run this 
through the scheduler folks, we can get a facility into an isolated 
brach for Avi to pull, but it needs some thought.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra May 20, 2009, 7:59 a.m. UTC | #2
On Tue, 2009-05-19 at 13:56 -0500, Mark Langsdorf wrote:
> @@ -1947,6 +1947,11 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
>         return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +void set_task_delay(struct task_struct *p, unsigned int delay)
> +{
> +       p->se.vruntime += delay;
> +}
> +EXPORT_SYMBOL(set_task_delay);

That's broken, you cannot assume that a task is SCHED_OTHER like that.

Furthermore, you cannot simply change vruntime of any odd task, this
only works for current. Also, you really need to call schedule() after
doing this for it to have any immediate effect.

Also, if you mean delay to be ns, you need to scale it. Furthermore, I
would really really want to export this as GPL only (ok, preferably not
at all).

That said, I still thoroughly dislike this whole approach.


/*
 * Dumb broken yield like interface -- use at your own peril and know
 * RT people will hate you.
 *
 * Like yield, except for SCHED_OTHER/BATCH, where it will give up @ns
 * time for the 'good' cause.
 */
void sched_delay_yield(unsigned long ns)
{
	struct task_struct *curr = current;

	if (curr->sched_class == &fair_sched_class) {
		struct sched_entity *se = &curr->se;
		__update_curr(cfs_rq_of(se), se, ns);
		schedule();
		/* XXX: task accounting ? */
	} else
		sched_yield();
}
EXPORT_SYMBOL_GPL(sched_delay_yield);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 20, 2009, 8:38 a.m. UTC | #3
Peter Zijlstra wrote:
> That said, I still thoroughly dislike this whole approach.
>   

Can you explain why?  We have a thread that has detected that it's 
spinning.  Keeping on spinning is a waste of cpu time.  Why not let 
something else use the cpu?
Peter Zijlstra May 20, 2009, 8:42 a.m. UTC | #4
On Wed, 2009-05-20 at 11:38 +0300, Avi Kivity wrote:
> Peter Zijlstra wrote:
> > That said, I still thoroughly dislike this whole approach.
> >   
> 
> Can you explain why?  We have a thread that has detected that it's 
> spinning.  Keeping on spinning is a waste of cpu time.  Why not let 
> something else use the cpu?

Because its a polling interface. I much prefer it if we were to get a
wakeup notification when the vcpu holding the lock releases it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 20, 2009, 8:49 a.m. UTC | #5
Peter Zijlstra wrote:
>>> That said, I still thoroughly dislike this whole approach.
>>>   
>>>       
>> Can you explain why?  We have a thread that has detected that it's 
>> spinning.  Keeping on spinning is a waste of cpu time.  Why not let 
>> something else use the cpu?
>>     
>
> Because its a polling interface. I much prefer it if we were to get a
> wakeup notification when the vcpu holding the lock releases it.
>   

It's a fully virtualized guest.  There's no way to get this without 
patching the guest kernel.
Peter Zijlstra May 20, 2009, 8:54 a.m. UTC | #6
On Wed, 2009-05-20 at 11:49 +0300, Avi Kivity wrote:
> Peter Zijlstra wrote:
> >>> That said, I still thoroughly dislike this whole approach.
> >>>   
> >>>       
> >> Can you explain why?  We have a thread that has detected that it's 
> >> spinning.  Keeping on spinning is a waste of cpu time.  Why not let 
> >> something else use the cpu?
> >>     
> >
> > Because its a polling interface. I much prefer it if we were to get a
> > wakeup notification when the vcpu holding the lock releases it.
> >   
> 
> It's a fully virtualized guest.  There's no way to get this without 
> patching the guest kernel.

Yes there is.. virtualized monitor-wait stuff coupled with a
monitor-wait based spinlock implementation.

Once we go change silicon, you might as well do it right.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 20, 2009, 9:04 a.m. UTC | #7
Peter Zijlstra wrote:

  

>> It's a fully virtualized guest.  There's no way to get this without 
>> patching the guest kernel.
>>     
>
> Yes there is.. virtualized monitor-wait stuff coupled with a
> monitor-wait based spinlock implementation.
>   

That only works if the guest uses monitor/mwait.  Not all of the guests 
are under our control.  I don't know whether Windows uses 
monitor/mwait.  Further, we don't have timed exits on mwait like we do 
with pause.

I've also heard that monitor/mwait are very slow and only usable on idle 
loop stuff.

> Once we go change silicon, you might as well do it right.
>   

None of the major x86 vendors are under my control.
Peter Zijlstra May 20, 2009, 9:10 a.m. UTC | #8
On Wed, 2009-05-20 at 12:04 +0300, Avi Kivity wrote:
> Peter Zijlstra wrote:

> >> It's a fully virtualized guest.  There's no way to get this without 
> >> patching the guest kernel.
> >>     
> >
> > Yes there is.. virtualized monitor-wait stuff coupled with a
> > monitor-wait based spinlock implementation.
> >   
> 
> That only works if the guest uses monitor/mwait.  Not all of the guests 
> are under our control.  I don't know whether Windows uses 
> monitor/mwait.  Further, we don't have timed exits on mwait like we do 
> with pause.

Ugh, you really care about crap like windows?

> I've also heard that monitor/mwait are very slow and only usable on idle 
> loop stuff.

Yeah, current implementations suck, doesn't mean it has to stay that
way.

> > Once we go change silicon, you might as well do it right.
> >   
> 
> None of the major x86 vendors are under my control.

I thought this patch came from AMD, who changed their silicon so 'solve'
one of these virt problems.

/me goes hide again, and pretend all of virt doesn't exist :-) Think
happy thoughts.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity May 20, 2009, 9:17 a.m. UTC | #9
Peter Zijlstra wrote:
>>>> It's a fully virtualized guest.  There's no way to get this without 
>>>> patching the guest kernel.
>>>>     
>>>>         
>>> Yes there is.. virtualized monitor-wait stuff coupled with a
>>> monitor-wait based spinlock implementation.
>>>   
>>>       
>> That only works if the guest uses monitor/mwait.  Not all of the guests 
>> are under our control.  I don't know whether Windows uses 
>> monitor/mwait.  Further, we don't have timed exits on mwait like we do 
>> with pause.
>>     
>
> Ugh, you really care about crap like windows?
>   

Yes, it is used by my users.  Either we convince them not to use 
Windows, or we find a way to support it well.

>> I've also heard that monitor/mwait are very slow and only usable on idle 
>> loop stuff.
>>     
>
> Yeah, current implementations suck, doesn't mean it has to stay that
> way.
>   

Well, I'm not speculating on future cpu changes.  I'd like to support 
current and near-future software and hardware, not how it should have 
been done software running on how it should have been done hardware.

>>> Once we go change silicon, you might as well do it right.
>>>   
>>>       
>> None of the major x86 vendors are under my control.
>>     
>
> I thought this patch came from AMD, who changed their silicon so 'solve'
> one of these virt problems.
>   

They changed the silicon to support existing guests.  For both Linux and 
Windows, the pause instruction is the only indication the guest is spinning.

> /me goes hide again, and pretend all of virt doesn't exist :-) Think
> happy thoughts.
>   

You'll end up running permanently in a guest, with no way out.
Avi Kivity May 20, 2009, noon UTC | #10
Mark Langsdorf wrote:
> On a 24 core system running 4 guests each with 16 VCPUs,
> this patch improved overall performance of each guest's
> 32 job kernbench by approximately 1%.  Further performance
> improvement may be possible with a more sophisticated
> yield algorithm.
>
>   

This result is approximately what you got on your previous patch.  Did 
you measure with the new patch?  approximately 1% seems to be too low.
Mark Langsdorf May 20, 2009, 1:52 p.m. UTC | #11
> > > Once we go change silicon, you might as well do it right.
> > >   
> > 
> > None of the major x86 vendors are under my control.
> 
> I thought this patch came from AMD, who changed their silicon 
> so 'solve' one of these virt problems.

Right, and the change we came up with was to provide
a method for filtering PAUSEs and then intercepting
after a certain number.

It may not be the solution you wanted, but I can't
change the silicon design that has the feature at 
this point.

-Mark Langsdorf
Operating System Research Center
AMD

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 85574b7..1fecb7e 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -57,7 +57,8 @@  struct __attribute__ ((__packed__)) vmcb_control_area {
 	u16 intercept_dr_write;
 	u32 intercept_exceptions;
 	u64 intercept;
-	u8 reserved_1[44];
+	u8 reserved_1[42];
+	u16 pause_filter_count;
 	u64 iopm_base_pa;
 	u64 msrpm_base_pa;
 	u64 tsc_offset;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ef43a18..86df191 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -45,6 +45,7 @@  MODULE_LICENSE("GPL");
 #define SVM_FEATURE_NPT  (1 << 0)
 #define SVM_FEATURE_LBRV (1 << 1)
 #define SVM_FEATURE_SVML (1 << 2)
+#define SVM_FEATURE_PAUSE_FILTER (1 << 10)
 
 #define DEBUGCTL_RESERVED_BITS (~(0x3fULL))
 
@@ -575,6 +576,11 @@  static void init_vmcb(struct vcpu_svm *svm)
 
 	svm->nested_vmcb = 0;
 	svm->vcpu.arch.hflags = HF_GIF_MASK;
+
+	if (svm_has(SVM_FEATURE_PAUSE_FILTER)) {
+		control->pause_filter_count = 3000;
+		control->intercept |= (1ULL << INTERCEPT_PAUSE);
+	}
 }
 
 static int svm_vcpu_reset(struct kvm_vcpu *vcpu)
@@ -2087,6 +2093,12 @@  static int interrupt_window_interception(struct vcpu_svm *svm,
 	return 1;
 }
 
+static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run)
+{
+	set_task_delay(current, 1000000);
+	return 1;
+}
+
 static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
 				      struct kvm_run *kvm_run) = {
 	[SVM_EXIT_READ_CR0]           		= emulate_on_interception,
@@ -2123,6 +2135,7 @@  static int (*svm_exit_handlers[])(struct vcpu_svm *svm,
 	[SVM_EXIT_CPUID]			= cpuid_interception,
 	[SVM_EXIT_IRET]                         = iret_interception,
 	[SVM_EXIT_INVD]                         = emulate_on_interception,
+	[SVM_EXIT_PAUSE]			= pause_interception,
 	[SVM_EXIT_HLT]				= halt_interception,
 	[SVM_EXIT_INVLPG]			= invlpg_interception,
 	[SVM_EXIT_INVLPGA]			= invalid_op_interception,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..683bc65 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2283,6 +2283,9 @@  static inline unsigned int task_cpu(const struct task_struct *p)
 	return task_thread_info(p)->cpu;
 }
 
+extern void set_task_delay(struct task_struct *p, unsigned int delay);
+
+
 extern void set_task_cpu(struct task_struct *p, unsigned int cpu);
 
 #else
@@ -2292,6 +2295,10 @@  static inline unsigned int task_cpu(const struct task_struct *p)
 	return 0;
 }
 
+void set_task_delay(struct task_struct *p, unsigned int delay)
+{
+}
+
 static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)
 {
 }
diff --git a/kernel/sched.c b/kernel/sched.c
index b902e58..3174620 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1947,6 +1947,11 @@  task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
 	return delta < (s64)sysctl_sched_migration_cost;
 }
 
+void set_task_delay(struct task_struct *p, unsigned int delay)
+{
+	p->se.vruntime += delay;
+}
+EXPORT_SYMBOL(set_task_delay);
 
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {