diff mbox

[RESEND] KVM:VMX: Add support for Pause-Loop Exiting

Message ID 4ABC18C6.9070202@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Zhai, Edwin Sept. 25, 2009, 1:11 a.m. UTC
Avi,

hrtimer is used for sleep in attached patch, which have similar perf 
gain with previous one. Maybe we can check in this patch first, and turn 
to direct yield in future, as you suggested.

Thanks,
edwin

Avi Kivity wrote:
> On 09/23/2009 05:04 PM, Zhai, Edwin wrote:
>   
>> Avi,
>>
>> This is the patch to enable PLE, which depends on the a small change 
>> of Linux scheduler
>> (see http://lkml.org/lkml/2009/5/20/447).
>>
>> According to our discussion last time, one missing part is that if PLE
>> exit, pick up an unscheduled vcpu at random and schedule it. But
>> further investigation found that:
>> 1. KVM is hard to know the schedule state for each vcpu.
>> 2. Linux scheduler has no existed API can be used to pull a specific
>> task to this cpu, so we need more changes to the common scheduler.
>> So I prefer current simple way: just give up current cpu time.
>>
>> If no objection, I'll try to push common scheduler change first to
>> linux.
>>     
>
> We haven't sorted out what is the correct thing to do here.  I think we 
> should go for a directed yield, but until we have it, you can use 
> hrtimers to sleep for 100 microseconds and hope the holding vcpu will 
> get scheduled.  Even if it doesn't, we're only wasting a few percent cpu 
> time instead of spinning.
>
>

Comments

Avi Kivity Sept. 27, 2009, 8:28 a.m. UTC | #1
On 09/25/2009 04:11 AM, Zhai, Edwin wrote:
> Avi,
>
> hrtimer is used for sleep in attached patch, which have similar perf 
> gain with previous one. Maybe we can check in this patch first, and 
> turn to direct yield in future, as you suggested.
>
> +/*
> + * These 2 parameters are used to config the controls for Pause-Loop Exiting:
> + * ple_gap:    upper bound on the amount of time between two successive
> + *             executions of PAUSE in a loop. Also indicate if ple enabled.
> + *             According to test, this time is usually small than 41 cycles.
> + * ple_window: upper bound on the amount of time a guest is allowed to execute
> + *             in a PAUSE loop. Tests indicate that most spinlocks are held for
> + *             less than 2^12 cycles
> + * Time is measured based on a counter that runs at the same rate as the TSC,
> + * refer SDM volume 3b section 21.6.13&  22.1.3.
> + */
> +#define KVM_VMX_DEFAULT_PLE_GAP    41
> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
> +module_param(ple_gap, int, S_IRUGO);
> +
> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
> +module_param(ple_window, int, S_IRUGO);
>    

Shouldn't be __read_mostly since they're read very rarely (__read_mostly 
should be for variables that are very often read, and rarely written).

I'm not even sure they should be parameters.

>   /*
> + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
> + * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
> + */
> +static int handle_pause(struct kvm_vcpu *vcpu,
> +				struct kvm_run *kvm_run)
> +{
> +	ktime_t expires;
> +	skip_emulated_instruction(vcpu);
> +
> +	/* Sleep for 1 msec, and hope lock-holder got scheduled */
> +	expires = ktime_add_ns(ktime_get(), 1000000UL);
>    

I think this should be much lower, 50-100us.  Maybe this should be a 
parameter.  With 1ms we losing significant cpu time if the congestion 
clears.

> +	set_current_state(TASK_INTERRUPTIBLE);
> +	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
> +
>    

Please add a tracepoint for this (since it can cause significant change 
in behaviour), and move the logic to kvm_main.c.  It will be reused by 
the AMD implementation, possibly my software spinlock detector, 
paravirtualized spinlocks, and hopefully other architectures.

> +	return 1;
> +}
> +
> +/*
>
Zhai, Edwin Sept. 28, 2009, 9:33 a.m. UTC | #2
Avi Kivity wrote:
> +#define KVM_VMX_DEFAULT_PLE_GAP    41
> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
> +module_param(ple_gap, int, S_IRUGO);
> +
> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
> +module_param(ple_window, int, S_IRUGO);
>    
>   
>
> Shouldn't be __read_mostly since they're read very rarely (__read_mostly 
> should be for variables that are very often read, and rarely written).
>   

In general, they are read only except that experienced user may try 
different parameter for perf tuning.

> I'm not even sure they should be parameters.
>   

For different spinlock in different OS, and for different workloads, we 
need different parameter for tuning. It's similar as the enable_ept.

>   
>>   /*
>> + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
>> + * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
>> + */
>> +static int handle_pause(struct kvm_vcpu *vcpu,
>> +				struct kvm_run *kvm_run)
>> +{
>> +	ktime_t expires;
>> +	skip_emulated_instruction(vcpu);
>> +
>> +	/* Sleep for 1 msec, and hope lock-holder got scheduled */
>> +	expires = ktime_add_ns(ktime_get(), 1000000UL);
>>    
>>     
>
> I think this should be much lower, 50-100us.  Maybe this should be a 
> parameter.  With 1ms we losing significant cpu time if the congestion 
> clears.
>   

I have made it a parameter with default value of 100 us.

>   
>> +	set_current_state(TASK_INTERRUPTIBLE);
>> +	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
>> +
>>    
>>     
>
> Please add a tracepoint for this (since it can cause significant change 
> in behaviour), 

Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE 
vmexit from other vmexits.

> and move the logic to kvm_main.c.  It will be reused by 
> the AMD implementation, possibly my software spinlock detector, 
> paravirtualized spinlocks, and hopefully other architectures.
>   

Done.
>   
>> +	return 1;
>> +}
>> +
>> +/*
>>    
>>     
>
>
Zhai, Edwin Sept. 29, 2009, 12:05 p.m. UTC | #3
Avi,
Any comments for this new patch?
Thanks,


Zhai, Edwin wrote:
> Avi Kivity wrote:
>   
>> +#define KVM_VMX_DEFAULT_PLE_GAP    41
>> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
>> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>> +module_param(ple_gap, int, S_IRUGO);
>> +
>> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
>> +module_param(ple_window, int, S_IRUGO);
>>    
>>   
>>
>> Shouldn't be __read_mostly since they're read very rarely (__read_mostly 
>> should be for variables that are very often read, and rarely written).
>>   
>>     
>
> In general, they are read only except that experienced user may try 
> different parameter for perf tuning.
>
>   
>> I'm not even sure they should be parameters.
>>   
>>     
>
> For different spinlock in different OS, and for different workloads, we 
> need different parameter for tuning. It's similar as the enable_ept.
>
>   
>>   
>>     
>>>   /*
>>> + * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
>>> + * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
>>> + */
>>> +static int handle_pause(struct kvm_vcpu *vcpu,
>>> +				struct kvm_run *kvm_run)
>>> +{
>>> +	ktime_t expires;
>>> +	skip_emulated_instruction(vcpu);
>>> +
>>> +	/* Sleep for 1 msec, and hope lock-holder got scheduled */
>>> +	expires = ktime_add_ns(ktime_get(), 1000000UL);
>>>    
>>>     
>>>       
>> I think this should be much lower, 50-100us.  Maybe this should be a 
>> parameter.  With 1ms we losing significant cpu time if the congestion 
>> clears.
>>   
>>     
>
> I have made it a parameter with default value of 100 us.
>
>   
>>   
>>     
>>> +	set_current_state(TASK_INTERRUPTIBLE);
>>> +	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
>>> +
>>>    
>>>     
>>>       
>> Please add a tracepoint for this (since it can cause significant change 
>> in behaviour), 
>>     
>
> Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE 
> vmexit from other vmexits.
>
>   
>> and move the logic to kvm_main.c.  It will be reused by 
>> the AMD implementation, possibly my software spinlock detector, 
>> paravirtualized spinlocks, and hopefully other architectures.
>>   
>>     
>
> Done.
>   
>>   
>>     
>>> +	return 1;
>>> +}
>>> +
>>> +/*
>>>    
>>>     
>>>       
>>   
>>     
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Avi Kivity Sept. 29, 2009, 1:34 p.m. UTC | #4
On 09/28/2009 11:33 AM, Zhai, Edwin wrote:
>
> Avi Kivity wrote:
>> +#define KVM_VMX_DEFAULT_PLE_GAP    41
>> +#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
>> +static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>> +module_param(ple_gap, int, S_IRUGO);
>> +
>> +static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
>> +module_param(ple_window, int, S_IRUGO);
>>
>> Shouldn't be __read_mostly since they're read very rarely 
>> (__read_mostly should be for variables that are very often read, and 
>> rarely written).
>
> In general, they are read only except that experienced user may try 
> different parameter for perf tuning.


__read_mostly doesn't just mean it's read mostly.  It also means it's 
read often.  Otherwise it's just wasting space in hot cachelines.

>
>> I'm not even sure they should be parameters.
>
> For different spinlock in different OS, and for different workloads, 
> we need different parameter for tuning. It's similar as the enable_ept.

No, global parameters don't work for tuning workloads and guests since 
they cannot be modified on a per-guest basis.  enable_ept is only useful 
for debugging and testing.

>
>>> +    set_current_state(TASK_INTERRUPTIBLE);
>>> +    schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
>>> +
>>
>> Please add a tracepoint for this (since it can cause significant 
>> change in behaviour), 
>
> Isn't trace_kvm_exit(exit_reason, ...) enough? We can tell the PLE 
> vmexit from other vmexits.

Right.  I thought of the software spinlock detector, but that's another 
problem.

I think you can drop the sleep_time parameter, it can be part of the 
function.  Also kvm_vcpu_sleep() is confusing, we also sleep on halt.  
Please call it kvm_vcpu_on_spin() or something (since that's what the 
guest is doing).
diff mbox

Patch

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 272514c..2b49454 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -56,6 +56,7 @@ 
 #define SECONDARY_EXEC_ENABLE_VPID              0x00000020
 #define SECONDARY_EXEC_WBINVD_EXITING		0x00000040
 #define SECONDARY_EXEC_UNRESTRICTED_GUEST	0x00000080
+#define SECONDARY_EXEC_PAUSE_LOOP_EXITING	0x00000400
 
 
 #define PIN_BASED_EXT_INTR_MASK                 0x00000001
@@ -144,6 +145,8 @@  enum vmcs_field {
 	VM_ENTRY_INSTRUCTION_LEN        = 0x0000401a,
 	TPR_THRESHOLD                   = 0x0000401c,
 	SECONDARY_VM_EXEC_CONTROL       = 0x0000401e,
+	PLE_GAP                         = 0x00004020,
+	PLE_WINDOW                      = 0x00004022,
 	VM_INSTRUCTION_ERROR            = 0x00004400,
 	VM_EXIT_REASON                  = 0x00004402,
 	VM_EXIT_INTR_INFO               = 0x00004404,
@@ -248,6 +251,7 @@  enum vmcs_field {
 #define EXIT_REASON_MSR_READ            31
 #define EXIT_REASON_MSR_WRITE           32
 #define EXIT_REASON_MWAIT_INSTRUCTION   36
+#define EXIT_REASON_PAUSE_INSTRUCTION   40
 #define EXIT_REASON_MCE_DURING_VMENTRY	 41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS         44
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3fe0d42..21dbfe9 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -61,6 +61,25 @@  module_param_named(unrestricted_guest,
 static int __read_mostly emulate_invalid_guest_state = 0;
 module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 
+/*
+ * These 2 parameters are used to config the controls for Pause-Loop Exiting:
+ * ple_gap:    upper bound on the amount of time between two successive
+ *             executions of PAUSE in a loop. Also indicate if ple enabled.
+ *             According to test, this time is usually small than 41 cycles.
+ * ple_window: upper bound on the amount of time a guest is allowed to execute
+ *             in a PAUSE loop. Tests indicate that most spinlocks are held for
+ *             less than 2^12 cycles
+ * Time is measured based on a counter that runs at the same rate as the TSC,
+ * refer SDM volume 3b section 21.6.13 & 22.1.3.
+ */
+#define KVM_VMX_DEFAULT_PLE_GAP    41
+#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+static int __read_mostly ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
+module_param(ple_gap, int, S_IRUGO);
+
+static int __read_mostly ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
+module_param(ple_window, int, S_IRUGO);
+
 struct vmcs {
 	u32 revision_id;
 	u32 abort;
@@ -319,6 +338,12 @@  static inline int cpu_has_vmx_unrestricted_guest(void)
 		SECONDARY_EXEC_UNRESTRICTED_GUEST;
 }
 
+static inline int cpu_has_vmx_ple(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_PAUSE_LOOP_EXITING;
+}
+
 static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm)
 {
 	return flexpriority_enabled &&
@@ -1256,7 +1281,8 @@  static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf)
 			SECONDARY_EXEC_WBINVD_EXITING |
 			SECONDARY_EXEC_ENABLE_VPID |
 			SECONDARY_EXEC_ENABLE_EPT |
-			SECONDARY_EXEC_UNRESTRICTED_GUEST;
+			SECONDARY_EXEC_UNRESTRICTED_GUEST |
+			SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
 					&_cpu_based_2nd_exec_control) < 0)
@@ -1400,6 +1426,9 @@  static __init int hardware_setup(void)
 	if (enable_ept && !cpu_has_vmx_ept_2m_page())
 		kvm_disable_largepages();
 
+	if (!cpu_has_vmx_ple())
+		ple_gap = 0;
+
 	return alloc_kvm_area();
 }
 
@@ -2312,9 +2341,16 @@  static int vmx_vcpu_setup(struct vcpu_vmx *vmx)
 			exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
 		if (!enable_unrestricted_guest)
 			exec_control &= ~SECONDARY_EXEC_UNRESTRICTED_GUEST;
+		if (!ple_gap)
+			exec_control &= ~SECONDARY_EXEC_PAUSE_LOOP_EXITING;
 		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 	}
 
+	if (ple_gap) {
+		vmcs_write32(PLE_GAP, ple_gap);
+		vmcs_write32(PLE_WINDOW, ple_window);
+	}
+
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, !!bypass_guest_pf);
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, !!bypass_guest_pf);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
@@ -3362,6 +3398,24 @@  out:
 }
 
 /*
+ * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
+ * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
+ */
+static int handle_pause(struct kvm_vcpu *vcpu,
+				struct kvm_run *kvm_run)
+{
+	ktime_t expires;
+	skip_emulated_instruction(vcpu);
+
+	/* Sleep for 1 msec, and hope lock-holder got scheduled */
+	expires = ktime_add_ns(ktime_get(), 1000000UL);
+	set_current_state(TASK_INTERRUPTIBLE);
+	schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+
+	return 1;
+}
+
+/*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
  * to be done to userspace and return 0.
@@ -3397,6 +3451,7 @@  static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,
 	[EXIT_REASON_EPT_VIOLATION]	      = handle_ept_violation,
 	[EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
+	[EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
 };
 
 static const int kvm_vmx_max_exit_handlers =