Message ID | 4ABC18C6.9070202@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 09/25/2009 04:11 AM, Zhai, Edwin wrote: > Avi, > > hrtimer is used for sleep in attached patch, which have similar perf > gain with previous one. Maybe we can check in this patch first, and > turn to direct yield in future, as you suggested. > > +/* > + * These 2 parameters are used to config the controls for Pause-Loop Exiting: > + * ple_gap: upper bound on the amount of time between two successive > + * executions of PAUSE in a loop. Also indicate if ple enabled. > + * According to test, this time is usually small than 41 cycles. > + * ple_window: upper bound on the amount of time a guest is allowed to execute > + * in a PAUSE loop. Tests indicate that most spinlocks are held for > + * less than 2^12 cycles > + * Time is measured based on a counter that runs at the same rate as the TSC, > + * refer SDM volume 3b section 21.6.13& 22.1.3. > + */ > +#define KVM_VMX_DEFAULT_PLE_GAP 41 > +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 > +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; > +module_param(ple_gap, int, S_IRUGO); > + > +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; > +module_param(ple_window, int, S_IRUGO); > Shouldn't be __read_mostly since they're read very rarely (__read_mostly should be for variables that are very often read, and rarely written). I'm not even sure they should be parameters. > /* > + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE > + * exiting, so only get here on cpu with PAUSE-Loop-Exiting. > + */ > +static int handle_pause(struct kvm_vcpu *vcpu, > + struct kvm_run *kvm_run) > +{ > + ktime_t expires; > + skip_emulated_instruction(vcpu); > + > + /* Sleep for 1 msec, and hope lock-holder got scheduled */ > + expires = ktime_add_ns(ktime_get(), 1000000UL); > I think this should be much lower, 50-100us. Maybe this should be a parameter. With 1ms we losing significant cpu time if the congestion clears. > + set_current_state(TASK_INTERRUPTIBLE); > + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); > + > Please add a tracepoint for this (since it can cause significant change in behaviour), and move the logic to kvm_main.c. It will be reused by the AMD implementation, possibly my software spinlock detector, paravirtualized spinlocks, and hopefully other architectures. > + return 1; > +} > + > +/* >
Avi Kivity wrote: > +#define KVM_VMX_DEFAULT_PLE_GAP 41 > +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 > +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; > +module_param(ple_gap, int, S_IRUGO); > + > +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; > +module_param(ple_window, int, S_IRUGO); > > > > Shouldn't be __read_mostly since they're read very rarely (__read_mostly > should be for variables that are very often read, and rarely written). > In general, they are read only except that experienced user may try different parameter for perf tuning. > I'm not even sure they should be parameters. > For different spinlock in different OS, and for different workloads, we need different parameter for tuning. It's similar as the enable_ept. > >> /* >> + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE >> + * exiting, so only get here on cpu with PAUSE-Loop-Exiting. >> + */ >> +static int handle_pause(struct kvm_vcpu *vcpu, >> + struct kvm_run *kvm_run) >> +{ >> + ktime_t expires; >> + skip_emulated_instruction(vcpu); >> + >> + /* Sleep for 1 msec, and hope lock-holder got scheduled */ >> + expires = ktime_add_ns(ktime_get(), 1000000UL); >> >> > > I think this should be much lower, 50-100us. Maybe this should be a > parameter. With 1ms we losing significant cpu time if the congestion > clears. > I have made it a parameter with default value of 100 us. > >> + set_current_state(TASK_INTERRUPTIBLE); >> + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); >> + >> >> > > Please add a tracepoint for this (since it can cause significant change > in behaviour), Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE vmexit from other vmexits. > and move the logic to kvm_main.c. It will be reused by > the AMD implementation, possibly my software spinlock detector, > paravirtualized spinlocks, and hopefully other architectures. > Done. > >> + return 1; >> +} >> + >> +/* >> >> > >
Avi, Any comments for this new patch? Thanks, Zhai, Edwin wrote: > Avi Kivity wrote: > >> +#define KVM_VMX_DEFAULT_PLE_GAP 41 >> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 >> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; >> +module_param(ple_gap, int, S_IRUGO); >> + >> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; >> +module_param(ple_window, int, S_IRUGO); >> >> >> >> Shouldn't be __read_mostly since they're read very rarely (__read_mostly >> should be for variables that are very often read, and rarely written). >> >> > > In general, they are read only except that experienced user may try > different parameter for perf tuning. > > >> I'm not even sure they should be parameters. >> >> > > For different spinlock in different OS, and for different workloads, we > need different parameter for tuning. It's similar as the enable_ept. > > >> >> >>> /* >>> + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE >>> + * exiting, so only get here on cpu with PAUSE-Loop-Exiting. >>> + */ >>> +static int handle_pause(struct kvm_vcpu *vcpu, >>> + struct kvm_run *kvm_run) >>> +{ >>> + ktime_t expires; >>> + skip_emulated_instruction(vcpu); >>> + >>> + /* Sleep for 1 msec, and hope lock-holder got scheduled */ >>> + expires = ktime_add_ns(ktime_get(), 1000000UL); >>> >>> >>> >> I think this should be much lower, 50-100us. Maybe this should be a >> parameter. With 1ms we losing significant cpu time if the congestion >> clears. >> >> > > I have made it a parameter with default value of 100 us. > > >> >> >>> + set_current_state(TASK_INTERRUPTIBLE); >>> + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); >>> + >>> >>> >>> >> Please add a tracepoint for this (since it can cause significant change >> in behaviour), >> > > Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE > vmexit from other vmexits. > > >> and move the logic to kvm_main.c. It will be reused by >> the AMD implementation, possibly my software spinlock detector, >> paravirtualized spinlocks, and hopefully other architectures. >> >> > > Done. > >> >> >>> + return 1; >>> +} >>> + >>> +/* >>> >>> >>> >> >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/28/2009 11:33 AM, Zhai, Edwin wrote: > > Avi Kivity wrote: >> +#define KVM_VMX_DEFAULT_PLE_GAP 41 >> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 >> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; >> +module_param(ple_gap, int, S_IRUGO); >> + >> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; >> +module_param(ple_window, int, S_IRUGO); >> >> Shouldn't be __read_mostly since they're read very rarely >> (__read_mostly should be for variables that are very often read, and >> rarely written). > > In general, they are read only except that experienced user may try > different parameter for perf tuning. __read_mostly doesn't just mean it's read mostly. It also means it's read often. Otherwise it's just wasting space in hot cachelines. > >> I'm not even sure they should be parameters. > > For different spinlock in different OS, and for different workloads, > we need different parameter for tuning. It's similar as the enable_ept. No, global parameters don't work for tuning workloads and guests since they cannot be modified on a per-guest basis. enable_ept is only useful for debugging and testing. > >>> + set_current_state(TASK_INTERRUPTIBLE); >>> + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); >>> + >> >> Please add a tracepoint for this (since it can cause significant >> change in behaviour), > > Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE > vmexit from other vmexits. Right. I thought of the software spinlock detector, but that's another problem. I think you can drop the sleep_time parameter, it can be part of the function. Also kvm_vcpu_sleep() is confusing, we also sleep on halt. Please call it kvm_vcpu_on_spin() or something (since that's what the guest is doing).
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 272514c..2b49454 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -56,6 +56,7 @@ #define SECONDARY_EXEC_ENABLE_VPID 0x00000020 #define SECONDARY_EXEC_WBINVD_EXITING 0x00000040 #define SECONDARY_EXEC_UNRESTRICTED_GUEST 0x00000080 +#define SECONDARY_EXEC_PAUSE_LOOP_EXITING 0x00000400 #define PIN_BASED_EXT_INTR_MASK 0x00000001 @@ -144,6 +145,8 @@ enum vmcs_field { VM_ENTRY_INSTRUCTION_LEN = 0x0000401a, TPR_THRESHOLD = 0x0000401c, SECONDARY_VM_EXEC_CONTROL = 0x0000401e, + PLE_GAP = 0x00004020, + PLE_WINDOW = 0x00004022, VM_INSTRUCTION_ERROR = 0x00004400, VM_EXIT_REASON = 0x00004402, VM_EXIT_INTR_INFO = 0x00004404, @@ -248,6 +251,7 @@ enum vmcs_field { #define EXIT_REASON_MSR_READ 31 #define EXIT_REASON_MSR_WRITE 32 #define EXIT_REASON_MWAIT_INSTRUCTION 36 +#define EXIT_REASON_PAUSE_INSTRUCTION 40 #define EXIT_REASON_MCE_DURING_VMENTRY 41 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43 #define EXIT_REASON_APIC_ACCESS 44 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 3fe0d42..21dbfe9 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -61,6 +61,25 @@ module_param_named(unrestricted_guest, static int __read_mostly emulate_invalid_guest_state = 0; module_param(emulate_invalid_guest_state, bool, S_IRUGO); +/* + * These 2 parameters are used to config the controls for Pause-Loop Exiting: + * ple_gap: upper bound on the amount of time between two successive + * executions of PAUSE in a loop. Also indicate if ple enabled. + * According to test, this time is usually small than 41 cycles. + * ple_window: upper bound on the amount of time a guest is allowed to execute + * in a PAUSE loop. Tests indicate that most spinlocks are held for + * less than 2^12 cycles + * Time is measured based on a counter that runs at the same rate as the TSC, + * refer SDM volume 3b section 21.6.13 & 22.1.3. + */ +#define KVM_VMX_DEFAULT_PLE_GAP 41 +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096 +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP; +module_param(ple_gap, int, S_IRUGO); + +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW; +module_param(ple_window, int, S_IRUGO); + struct vmcs { u32 revision_id; u32 abort; @@ -319,6 +338,12 @@ static inline int cpu_has_vmx_unrestricted_guest(void) SECONDARY_EXEC_UNRESTRICTED_GUEST; } +static inline int cpu_has_vmx_ple(void) +{ + return vmcs_config.cpu_based_2nd_exec_ctrl & + SECONDARY_EXEC_PAUSE_LOOP_EXITING; +} + static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm) { return flexpriority_enabled && @@ -1256,7 +1281,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) SECONDARY_EXEC_WBINVD_EXITING | SECONDARY_EXEC_ENABLE_VPID | SECONDARY_EXEC_ENABLE_EPT | - SECONDARY_EXEC_UNRESTRICTED_GUEST; + SECONDARY_EXEC_UNRESTRICTED_GUEST | + SECONDARY_EXEC_PAUSE_LOOP_EXITING; if (adjust_vmx_controls(min2, opt2, MSR_IA32_VMX_PROCBASED_CTLS2, &_cpu_based_2nd_exec_control) < 0) @@ -1400,6 +1426,9 @@ static __init int hardware_setup(void) if (enable_ept && !cpu_has_vmx_ept_2m_page()) kvm_disable_largepages(); + if (!cpu_has_vmx_ple()) + ple_gap = 0; + return alloc_kvm_area(); } @@ -2312,9 +2341,16 @@ static int vmx_vcpu_setup(struct vcpu_vmx *vmx) exec_control &= ~SECONDARY_EXEC_ENABLE_EPT; if (!enable_unrestricted_guest) exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST; + if (!ple_gap) + exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING; vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control); } + if (ple_gap) { + vmcs_write32(PLE_GAP, ple_gap); + vmcs_write32(PLE_WINDOW, ple_window); + } + vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf); vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf); vmcs_write32(CR3_TARGET_COUNT, 0); /* 22.2.1 */ @@ -3362,6 +3398,24 @@ out: } /* + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE + * exiting, so only get here on cpu with PAUSE-Loop-Exiting. + */ +static int handle_pause(struct kvm_vcpu *vcpu, + struct kvm_run *kvm_run) +{ + ktime_t expires; + skip_emulated_instruction(vcpu); + + /* Sleep for 1 msec, and hope lock-holder got scheduled */ + expires = ktime_add_ns(ktime_get(), 1000000UL); + set_current_state(TASK_INTERRUPTIBLE); + schedule_hrtimeout(&expires, HRTIMER_MODE_ABS); + + return 1; +} + +/* * The exit handlers return 1 if the exit was handled fully and guest execution * may resume. Otherwise they set the kvm_run parameter to indicate what needs * to be done to userspace and return 0. @@ -3397,6 +3451,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { [EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check, [EXIT_REASON_EPT_VIOLATION] = handle_ept_violation, [EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig, + [EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause, }; static const int kvm_vmx_max_exit_handlers =