[v3,4/7] KVM: VMX: dynamise PLE window

Message ID	1408637291-18533-5-git-send-email-rkrcmar@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= <rkrcmar@redhat.com> To: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>, Gleb Natapov <gleb@kernel.org>, Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>, Vinod Chegu <chegu_vinod@hp.com>, Hui-Zhi Zhao <hui-zhi.zhao@hp.com>, Christian Borntraeger <borntraeger@de.ibm.com>, Lisa Mitchell <lisa.mitchell@hp.com> Subject: [PATCH v3 4/7] KVM: VMX: dynamise PLE window Date: Thu, 21 Aug 2014 18:08:08 +0200 Message-Id: <1408637291-18533-5-git-send-email-rkrcmar@redhat.com> In-Reply-To: <1408637291-18533-1-git-send-email-rkrcmar@redhat.com> References: <1408637291-18533-1-git-send-email-rkrcmar@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: kvm-owner@vger.kernel.org Precedence: bulk

Message ID

1408637291-18533-5-git-send-email-rkrcmar@redhat.com (mailing list archive)

State

New, archived

Headers

From: =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= <rkrcmar@redhat.com>
To: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	Gleb Natapov <gleb@kernel.org>,
	Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>,
	Vinod Chegu <chegu_vinod@hp.com>, Hui-Zhi Zhao <hui-zhi.zhao@hp.com>,
	Christian Borntraeger <borntraeger@de.ibm.com>,
	Lisa Mitchell <lisa.mitchell@hp.com>
Subject: [PATCH v3 4/7] KVM: VMX: dynamise PLE window
Date: Thu, 21 Aug 2014 18:08:08 +0200
Message-Id: <1408637291-18533-5-git-send-email-rkrcmar@redhat.com>
In-Reply-To: <1408637291-18533-1-git-send-email-rkrcmar@redhat.com>
References: <1408637291-18533-1-git-send-email-rkrcmar@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: kvm-owner@vger.kernel.org
Precedence: bulk

Commit Message

Radim Krčmář Aug. 21, 2014, 4:08 p.m. UTC

Window is increased on every PLE exit and decreased on every sched_in.
The idea is that we don't want to PLE exit if there is no preemption
going on.
We do this with sched_in() because it does not hold rq lock.

There are two new kernel parameters for changing the window:
 ple_window_grow and ple_window_shrink
ple_window_grow affects the window on PLE exit and ple_window_shrink
does it on sched_in;  depending on their value, the window is modifier
like this: (ple_window is kvm_intel's global)

  ple_window_shrink/ |
  ple_window_grow    | PLE exit           | sched_in
  -------------------+--------------------+---------------------
  < 1                |  = ple_window      |  = ple_window
  < ple_window       | *= ple_window_grow | /= ple_window_shrink
  otherwise          | += ple_window_grow | -= ple_window_shrink

A third new parameter, ple_window_max, controls the maximal ple_window;
it is internally rounded down to a closest multiple of ple_window_grow.

VCPU's PLE window is never allowed below ple_window.

Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
---
 arch/x86/kvm/vmx.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 85 insertions(+), 2 deletions(-)

Comments

Raghavendra K T Aug. 21, 2014, 7:10 p.m. UTC | #1

On 08/21/2014 09:38 PM, Radim Kr?má? wrote:
> Window is increased on every PLE exit and decreased on every sched_in.
> The idea is that we don't want to PLE exit if there is no preemption
> going on.
> We do this with sched_in() because it does not hold rq lock.
>
> There are two new kernel parameters for changing the window:
>   ple_window_grow and ple_window_shrink
> ple_window_grow affects the window on PLE exit and ple_window_shrink
> does it on sched_in;  depending on their value, the window is modifier
> like this: (ple_window is kvm_intel's global)
>
>    ple_window_shrink/ |
>    ple_window_grow    | PLE exit           | sched_in
>    -------------------+--------------------+---------------------
>    < 1                |  = ple_window      |  = ple_window
>    < ple_window       | *= ple_window_grow | /= ple_window_shrink
>    otherwise          | += ple_window_grow | -= ple_window_shrink
>
> A third new parameter, ple_window_max, controls the maximal ple_window;
> it is internally rounded down to a closest multiple of ple_window_grow.
>
> VCPU's PLE window is never allowed below ple_window.
>
> Signed-off-by: Radim Kr?má? <rkrcmar@redhat.com>
> ---
>   arch/x86/kvm/vmx.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 85 insertions(+), 2 deletions(-)
>

Thanks for the nice patch.
default of grow = 2 and shrink = 0 is very good, which aids fast
clamping in overcommit and less ple_exits in undercommit.
with a small concern over modifier (shrinker) value in patch 6,

Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 18e0e52..a2fa6ba 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -125,14 +125,32 @@ module_param(nested, bool, S_IRUGO);
>    * Time is measured based on a counter that runs at the same rate as the TSC,
>    * refer SDM volume 3b section 21.6.13 & 22.1.3.
>    */
> -#define KVM_VMX_DEFAULT_PLE_GAP    128
> -#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> +#define KVM_VMX_DEFAULT_PLE_GAP           128
> +#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
> +#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
> +		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
> +
>   static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
>   module_param(ple_gap, int, S_IRUGO);
>
>   static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
>   module_param(ple_window, int, S_IRUGO);
>
> +/* Default doubles per-vcpu window every exit. */
> +static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
> +module_param(ple_window_grow, int, S_IRUGO);
> +
> +/* Default resets per-vcpu window every exit to ple_window. */
> +static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
> +module_param(ple_window_shrink, int, S_IRUGO);
> +
> +/* Default is to compute the maximum so we can never overflow. */
> +static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
> +static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
> +module_param(ple_window_max, int, S_IRUGO);

Some observations:

1. I think a large ple_window for e.g.,  16MB itself
would be very large value after which the ple_window change would not
have much effect.

So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just
KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise..

How about ..
define KVM_VMX_DEFAULT_PLE_WINDOW_MAX  = INT_MAX 
/KVM_VMX_DEFAULT_PLE_WINDOW.

(considering we don't allow default grow to be greater than default ple
window).

2. can we consider min_ple_window value of 2k. tracing showed that in
overcommit there were several occations of 4k <- 4k situations.

3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
considering we could have potentially have grow,shrink numbers != power of 2

> +
>   extern const ulong vmx_return;
>
>   #define NR_AUTOLOAD_MSRS 8
> @@ -5679,12 +5697,73 @@ out:
>   	return ret;
>   }
>
> +static int __grow_ple_window(int val)
> +{
> +	if (ple_window_grow < 1)

why not ple_window_grow <= 1 ?


> +		return ple_window;
> +
> +	val = min(val, ple_window_actual_max);
> +
> +	if (ple_window_grow < ple_window)
> +		val *= ple_window_grow;
> +	else
> +		val += ple_window_grow;
> +
> +	return val;
> +}
> +
> +static int __shrink_ple_window(int val, int modifier, int minimum)
> +{
> +	if (modifier < 1)

why not modifier < 1. May be this would address a concern in patch 6.

> +		return ple_window;
> +
> +	if (modifier < ple_window)
> +		val /= modifier;
> +	else
> +		val -= modifier;
> +
> +	return max(val, minimum);
> +}
> +
> +static void grow_ple_window(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->ple_window = __grow_ple_window(vmx->ple_window);
> +}
> +
> +static void shrink_ple_window(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> +	vmx->ple_window = __shrink_ple_window(vmx->ple_window,
> +	                                      ple_window_shrink, ple_window);
> +}
> +
> +/*
> + * ple_window_actual_max is computed to be one grow_ple_window() below
> + * ple_window_max. (See __grow_ple_window for the reason.)
> + * This prevents overflows, because ple_window_max is int.
> + * ple_window_max effectively rounded down to a multiple of ple_window_grow in
> + * this process.
> + * ple_window_max is also prevented from setting vmx->ple_window < ple_window.
> + */
> +static void update_ple_window_actual_max(void)
> +{
> +	ple_window_actual_max =
> +			__shrink_ple_window(max(ple_window_max, ple_window),
> +			                    ple_window_grow, INT_MIN);
> +}
> +
>   /*
>    * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
>    * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
>    */
>   static int handle_pause(struct kvm_vcpu *vcpu)
>   {
> +	if (ple_gap)
> +		grow_ple_window(vcpu);
> +
>   	skip_emulated_instruction(vcpu);
>   	kvm_vcpu_on_spin(vcpu);
>
> @@ -8854,6 +8933,8 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>
>   void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>   {
> +	if (ple_gap)
> +		shrink_ple_window(vcpu);
>   }
>
>   static struct kvm_x86_ops vmx_x86_ops = {
> @@ -9077,6 +9158,8 @@ static int __init vmx_init(void)
>   	} else
>   		kvm_disable_tdp();
>
> +	update_ple_window_actual_max();
> +
>   	return 0;
>
>   out7:
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Aug. 21, 2014, 8:31 p.m. UTC | #2

Il 21/08/2014 21:10, Raghavendra K T ha scritto:
> 2. can we consider min_ple_window value of 2k. tracing showed that in
> overcommit there were several occations of 4k <- 4k situations.

Yes, that could be good now that we have adaptive exits.  Perhaps even 1k.

> 3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
> considering we could have potentially have grow,shrink numbers != power
> of 2

* is very fast. / a bit less so but it never occurs with the defaults.
So I think it's okay.  I'm not sure why one would use a multiplier
bigger than 2.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář Aug. 21, 2014, 8:59 p.m. UTC | #3

2014-08-22 00:40+0530, Raghavendra K T:
> On 08/21/2014 09:38 PM, Radim Kr?má? wrote:
> Thanks for the nice patch.
> default of grow = 2 and shrink = 0 is very good, which aids fast
> clamping in overcommit and less ple_exits in undercommit.
> with a small concern over modifier (shrinker) value in patch 6,
> 
> Reviewed-by: Raghavendra KT <raghavendra.kt@linux.vnet.ibm.com>

Thank you for the review and testing.

> >-#define KVM_VMX_DEFAULT_PLE_GAP    128
> >-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
> >+#define KVM_VMX_DEFAULT_PLE_GAP           128
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
> >+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
> >+		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
> 
> Some observations:
> 
> 1. I think a large ple_window for e.g.,  16MB itself
> would be very large value after which the ple_window change would not
> have much effect.

Agreed, 16M is around 4ms@4GHz, holding a spinlock for that long is
closer to a bug.

> So could KVM_VMX_DEFAULT_PLE_WINDOW_MAX be just
> KVM_VMX_DEFAULT_PLE_WINDOW^2 or something ? or otherwise..

I'd be perfectly content with a high default maximum like this, yet
there isn't much point in having that default at all if we can't reach
it in practice, so with the current default, we are at least ready for
THz+ x86 intels :)

I though about deafaulting it to some guessed fraction of the scheduler
tick, but then, I wanted to merge at least something :)

> How about ..
> define KVM_VMX_DEFAULT_PLE_WINDOW_MAX  = INT_MAX
> /KVM_VMX_DEFAULT_PLE_WINDOW.

= 524288 (2^19), too low IMO,
no-overcommit scenarios seem to go higher quite often.

> (considering we don't allow default grow to be greater than default ple
> window).
> 
> 2. can we consider min_ple_window value of 2k. tracing showed that in
> overcommit there were several occations of 4k <- 4k situations.

Low values are more affected by KVM's overhead on every PLE exit, but
benchmarks really decide this ...

Having a separate ple_window_min would allow more tuning, so I can do
that if there are benefits of lower windows.
On the other hand, I thought that increasing the default window would be
a good default, which would make separate minimum even better.

> 3. Do you think using ( << and >> ) instead of (*, /) saves some cycles
> considering we could have potentially have grow,shrink numbers != power of 2

* takes one or two cycles more than <<, so I wouldn't replace it,
because it limits available grows a lot.
(I don't expect we would set values above 5.)

/ is slow (around ten times slower than *), which the reason why we
avoid it by default, but I still prefer more options.

> >+static int __grow_ple_window(int val)
> >+{
> >+	if (ple_window_grow < 1)
> 
> why not ple_window_grow <= 1 ?

Emergent mini-feature: have different static levels of ple_window, in
combination with dynamic knobs.
(set grow and shrink to 1, choose ple_window before starting a guest)

And because you have to willingly set 1, I'm not aware of any advantage
of '<= 1'.

> >+static int __shrink_ple_window(int val, int modifier, int minimum)
> >+{
> >+	if (modifier < 1)
> 
> why not modifier < 1. May be this would address a concern in patch 6.

"/= 1" gives no shrinking, which I considered as a feature -- you can
easily search for the maximal achievable ple_window that way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 18e0e52..a2fa6ba 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -125,14 +125,32 @@  module_param(nested, bool, S_IRUGO);
  * Time is measured based on a counter that runs at the same rate as the TSC,
  * refer SDM volume 3b section 21.6.13 & 22.1.3.
  */
-#define KVM_VMX_DEFAULT_PLE_GAP    128
-#define KVM_VMX_DEFAULT_PLE_WINDOW 4096
+#define KVM_VMX_DEFAULT_PLE_GAP           128
+#define KVM_VMX_DEFAULT_PLE_WINDOW        4096
+#define KVM_VMX_DEFAULT_PLE_WINDOW_GROW   2
+#define KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK 0
+#define KVM_VMX_DEFAULT_PLE_WINDOW_MAX    \
+		INT_MAX / KVM_VMX_DEFAULT_PLE_WINDOW_GROW
+
 static int ple_gap = KVM_VMX_DEFAULT_PLE_GAP;
 module_param(ple_gap, int, S_IRUGO);
 
 static int ple_window = KVM_VMX_DEFAULT_PLE_WINDOW;
 module_param(ple_window, int, S_IRUGO);
 
+/* Default doubles per-vcpu window every exit. */
+static int ple_window_grow = KVM_VMX_DEFAULT_PLE_WINDOW_GROW;
+module_param(ple_window_grow, int, S_IRUGO);
+
+/* Default resets per-vcpu window every exit to ple_window. */
+static int ple_window_shrink = KVM_VMX_DEFAULT_PLE_WINDOW_SHRINK;
+module_param(ple_window_shrink, int, S_IRUGO);
+
+/* Default is to compute the maximum so we can never overflow. */
+static int ple_window_actual_max = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+static int ple_window_max        = KVM_VMX_DEFAULT_PLE_WINDOW_MAX;
+module_param(ple_window_max, int, S_IRUGO);
+
 extern const ulong vmx_return;
 
 #define NR_AUTOLOAD_MSRS 8
@@ -5679,12 +5697,73 @@  out:
 	return ret;
 }
 
+static int __grow_ple_window(int val)
+{
+	if (ple_window_grow < 1)
+		return ple_window;
+
+	val = min(val, ple_window_actual_max);
+
+	if (ple_window_grow < ple_window)
+		val *= ple_window_grow;
+	else
+		val += ple_window_grow;
+
+	return val;
+}
+
+static int __shrink_ple_window(int val, int modifier, int minimum)
+{
+	if (modifier < 1)
+		return ple_window;
+
+	if (modifier < ple_window)
+		val /= modifier;
+	else
+		val -= modifier;
+
+	return max(val, minimum);
+}
+
+static void grow_ple_window(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->ple_window = __grow_ple_window(vmx->ple_window);
+}
+
+static void shrink_ple_window(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	vmx->ple_window = __shrink_ple_window(vmx->ple_window,
+	                                      ple_window_shrink, ple_window);
+}
+
+/*
+ * ple_window_actual_max is computed to be one grow_ple_window() below
+ * ple_window_max. (See __grow_ple_window for the reason.)
+ * This prevents overflows, because ple_window_max is int.
+ * ple_window_max effectively rounded down to a multiple of ple_window_grow in
+ * this process.
+ * ple_window_max is also prevented from setting vmx->ple_window < ple_window.
+ */
+static void update_ple_window_actual_max(void)
+{
+	ple_window_actual_max =
+			__shrink_ple_window(max(ple_window_max, ple_window),
+			                    ple_window_grow, INT_MIN);
+}
+
 /*
  * Indicate a busy-waiting vcpu in spinlock. We do not enable the PAUSE
  * exiting, so only get here on cpu with PAUSE-Loop-Exiting.
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+	if (ple_gap)
+		grow_ple_window(vcpu);
+
 	skip_emulated_instruction(vcpu);
 	kvm_vcpu_on_spin(vcpu);
 
@@ -8854,6 +8933,8 @@  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
 
 void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
+	if (ple_gap)
+		shrink_ple_window(vcpu);
 }
 
 static struct kvm_x86_ops vmx_x86_ops = {
@@ -9077,6 +9158,8 @@  static int __init vmx_init(void)
 	} else
 		kvm_disable_tdp();
 
+	update_ple_window_actual_max();
+
 	return 0;
 
 out7:

[v3,4/7] KVM: VMX: dynamise PLE window

Commit Message

Comments

Patch