Message ID | 200905191356.37071.mark.langsdorf@amd.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Mark Langsdorf <mark.langsdorf@amd.com> wrote: > diff --git a/include/linux/sched.h b/include/linux/sched.h > index b4c38bc..683bc65 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2283,6 +2283,9 @@ static inline unsigned int task_cpu(const struct task_struct *p) > return task_thread_info(p)->cpu; > } > > +extern void set_task_delay(struct task_struct *p, unsigned int delay); > +++ b/kernel/sched.c > @@ -1947,6 +1947,11 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) > return delta < (s64)sysctl_sched_migration_cost; > } > > +void set_task_delay(struct task_struct *p, unsigned int delay) > +{ > + p->se.vruntime += delay; > +} > +EXPORT_SYMBOL(set_task_delay); vruntime is nice level scaled, so this is broken. Please run this through the scheduler folks, we can get a facility into an isolated brach for Avi to pull, but it needs some thought. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-05-19 at 13:56 -0500, Mark Langsdorf wrote: > @@ -1947,6 +1947,11 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) > return delta < (s64)sysctl_sched_migration_cost; > } > > +void set_task_delay(struct task_struct *p, unsigned int delay) > +{ > + p->se.vruntime += delay; > +} > +EXPORT_SYMBOL(set_task_delay); That's broken, you cannot assume that a task is SCHED_OTHER like that. Furthermore, you cannot simply change vruntime of any odd task, this only works for current. Also, you really need to call schedule() after doing this for it to have any immediate effect. Also, if you mean delay to be ns, you need to scale it. Furthermore, I would really really want to export this as GPL only (ok, preferably not at all). That said, I still thoroughly dislike this whole approach. /* * Dumb broken yield like interface -- use at your own peril and know * RT people will hate you. * * Like yield, except for SCHED_OTHER/BATCH, where it will give up @ns * time for the 'good' cause. */ void sched_delay_yield(unsigned long ns) { struct task_struct *curr = current; if (curr->sched_class == &fair_sched_class) { struct sched_entity *se = &curr->se; __update_curr(cfs_rq_of(se), se, ns); schedule(); /* XXX: task accounting ? */ } else sched_yield(); } EXPORT_SYMBOL_GPL(sched_delay_yield); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra wrote: > That said, I still thoroughly dislike this whole approach. > Can you explain why? We have a thread that has detected that it's spinning. Keeping on spinning is a waste of cpu time. Why not let something else use the cpu?
On Wed, 2009-05-20 at 11:38 +0300, Avi Kivity wrote: > Peter Zijlstra wrote: > > That said, I still thoroughly dislike this whole approach. > > > > Can you explain why? We have a thread that has detected that it's > spinning. Keeping on spinning is a waste of cpu time. Why not let > something else use the cpu? Because its a polling interface. I much prefer it if we were to get a wakeup notification when the vcpu holding the lock releases it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra wrote: >>> That said, I still thoroughly dislike this whole approach. >>> >>> >> Can you explain why? We have a thread that has detected that it's >> spinning. Keeping on spinning is a waste of cpu time. Why not let >> something else use the cpu? >> > > Because its a polling interface. I much prefer it if we were to get a > wakeup notification when the vcpu holding the lock releases it. > It's a fully virtualized guest. There's no way to get this without patching the guest kernel.
On Wed, 2009-05-20 at 11:49 +0300, Avi Kivity wrote: > Peter Zijlstra wrote: > >>> That said, I still thoroughly dislike this whole approach. > >>> > >>> > >> Can you explain why? We have a thread that has detected that it's > >> spinning. Keeping on spinning is a waste of cpu time. Why not let > >> something else use the cpu? > >> > > > > Because its a polling interface. I much prefer it if we were to get a > > wakeup notification when the vcpu holding the lock releases it. > > > > It's a fully virtualized guest. There's no way to get this without > patching the guest kernel. Yes there is.. virtualized monitor-wait stuff coupled with a monitor-wait based spinlock implementation. Once we go change silicon, you might as well do it right. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra wrote: >> It's a fully virtualized guest. There's no way to get this without >> patching the guest kernel. >> > > Yes there is.. virtualized monitor-wait stuff coupled with a > monitor-wait based spinlock implementation. > That only works if the guest uses monitor/mwait. Not all of the guests are under our control. I don't know whether Windows uses monitor/mwait. Further, we don't have timed exits on mwait like we do with pause. I've also heard that monitor/mwait are very slow and only usable on idle loop stuff. > Once we go change silicon, you might as well do it right. > None of the major x86 vendors are under my control.
On Wed, 2009-05-20 at 12:04 +0300, Avi Kivity wrote: > Peter Zijlstra wrote: > >> It's a fully virtualized guest. There's no way to get this without > >> patching the guest kernel. > >> > > > > Yes there is.. virtualized monitor-wait stuff coupled with a > > monitor-wait based spinlock implementation. > > > > That only works if the guest uses monitor/mwait. Not all of the guests > are under our control. I don't know whether Windows uses > monitor/mwait. Further, we don't have timed exits on mwait like we do > with pause. Ugh, you really care about crap like windows? > I've also heard that monitor/mwait are very slow and only usable on idle > loop stuff. Yeah, current implementations suck, doesn't mean it has to stay that way. > > Once we go change silicon, you might as well do it right. > > > > None of the major x86 vendors are under my control. I thought this patch came from AMD, who changed their silicon so 'solve' one of these virt problems. /me goes hide again, and pretend all of virt doesn't exist :-) Think happy thoughts. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Peter Zijlstra wrote: >>>> It's a fully virtualized guest. There's no way to get this without >>>> patching the guest kernel. >>>> >>>> >>> Yes there is.. virtualized monitor-wait stuff coupled with a >>> monitor-wait based spinlock implementation. >>> >>> >> That only works if the guest uses monitor/mwait. Not all of the guests >> are under our control. I don't know whether Windows uses >> monitor/mwait. Further, we don't have timed exits on mwait like we do >> with pause. >> > > Ugh, you really care about crap like windows? > Yes, it is used by my users. Either we convince them not to use Windows, or we find a way to support it well. >> I've also heard that monitor/mwait are very slow and only usable on idle >> loop stuff. >> > > Yeah, current implementations suck, doesn't mean it has to stay that > way. > Well, I'm not speculating on future cpu changes. I'd like to support current and near-future software and hardware, not how it should have been done software running on how it should have been done hardware. >>> Once we go change silicon, you might as well do it right. >>> >>> >> None of the major x86 vendors are under my control. >> > > I thought this patch came from AMD, who changed their silicon so 'solve' > one of these virt problems. > They changed the silicon to support existing guests. For both Linux and Windows, the pause instruction is the only indication the guest is spinning. > /me goes hide again, and pretend all of virt doesn't exist :-) Think > happy thoughts. > You'll end up running permanently in a guest, with no way out.
Mark Langsdorf wrote: > On a 24 core system running 4 guests each with 16 VCPUs, > this patch improved overall performance of each guest's > 32 job kernbench by approximately 1%. Further performance > improvement may be possible with a more sophisticated > yield algorithm. > > This result is approximately what you got on your previous patch. Did you measure with the new patch? approximately 1% seems to be too low.
> > > Once we go change silicon, you might as well do it right. > > > > > > > None of the major x86 vendors are under my control. > > I thought this patch came from AMD, who changed their silicon > so 'solve' one of these virt problems. Right, and the change we came up with was to provide a method for filtering PAUSEs and then intercepting after a certain number. It may not be the solution you wanted, but I can't change the silicon design that has the feature at this point. -Mark Langsdorf Operating System Research Center AMD -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 85574b7..1fecb7e 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -57,7 +57,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u16 intercept_dr_write; u32 intercept_exceptions; u64 intercept; - u8 reserved_1[44]; + u8 reserved_1[42]; + u16 pause_filter_count; u64 iopm_base_pa; u64 msrpm_base_pa; u64 tsc_offset; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ef43a18..86df191 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -45,6 +45,7 @@ MODULE_LICENSE("GPL"); #define SVM_FEATURE_NPT (1 << 0) #define SVM_FEATURE_LBRV (1 << 1) #define SVM_FEATURE_SVML (1 << 2) +#define SVM_FEATURE_PAUSE_FILTER (1 << 10) #define DEBUGCTL_RESERVED_BITS (~(0x3fULL)) @@ -575,6 +576,11 @@ static void init_vmcb(struct vcpu_svm *svm) svm->nested_vmcb = 0; svm->vcpu.arch.hflags = HF_GIF_MASK; + + if (svm_has(SVM_FEATURE_PAUSE_FILTER)) { + control->pause_filter_count = 3000; + control->intercept |= (1ULL << INTERCEPT_PAUSE); + } } static int svm_vcpu_reset(struct kvm_vcpu *vcpu) @@ -2087,6 +2093,12 @@ static int interrupt_window_interception(struct vcpu_svm *svm, return 1; } +static int pause_interception(struct vcpu_svm *svm, struct kvm_run *kvm_run) +{ + set_task_delay(current, 1000000); + return 1; +} + static int (*svm_exit_handlers[])(struct vcpu_svm *svm, struct kvm_run *kvm_run) = { [SVM_EXIT_READ_CR0] = emulate_on_interception, @@ -2123,6 +2135,7 @@ static int (*svm_exit_handlers[])(struct vcpu_svm *svm, [SVM_EXIT_CPUID] = cpuid_interception, [SVM_EXIT_IRET] = iret_interception, [SVM_EXIT_INVD] = emulate_on_interception, + [SVM_EXIT_PAUSE] = pause_interception, [SVM_EXIT_HLT] = halt_interception, [SVM_EXIT_INVLPG] = invlpg_interception, [SVM_EXIT_INVLPGA] = invalid_op_interception, diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..683bc65 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2283,6 +2283,9 @@ static inline unsigned int task_cpu(const struct task_struct *p) return task_thread_info(p)->cpu; } +extern void set_task_delay(struct task_struct *p, unsigned int delay); + + extern void set_task_cpu(struct task_struct *p, unsigned int cpu); #else @@ -2292,6 +2295,10 @@ static inline unsigned int task_cpu(const struct task_struct *p) return 0; } +void set_task_delay(struct task_struct *p, unsigned int delay) +{ +} + static inline void set_task_cpu(struct task_struct *p, unsigned int cpu) { } diff --git a/kernel/sched.c b/kernel/sched.c index b902e58..3174620 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -1947,6 +1947,11 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd) return delta < (s64)sysctl_sched_migration_cost; } +void set_task_delay(struct task_struct *p, unsigned int delay) +{ + p->se.vruntime += delay; +} +EXPORT_SYMBOL(set_task_delay); void set_task_cpu(struct task_struct *p, unsigned int new_cpu) {