diff mbox

[v2,1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Message ID 1381253894-18114-2-git-send-email-marc.zyngier@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Marc Zyngier Oct. 8, 2013, 5:38 p.m. UTC
On an (even slightly) oversubscribed system, spinlocks are quickly
becoming a bottleneck, as some vcpus are spinning, waiting for a
lock to be released, while the vcpu holding the lock may not be
running at all.

This creates contention, and the observed slowdown is 40x for
hackbench. No, this isn't a typo.

The solution is to trap blocking WFEs and tell KVM that we're
now spinning. This ensures that other vpus will get a scheduling
boost, allowing the lock to be released more quickly. Also, using
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT slightly improves the performance
when the VM is severely overcommited.

Quick test to estimate the performance: hackbench 1 process 1000

2xA15 host (baseline):	1.843s

2xA15 guest w/o patch:	2.083s
4xA15 guest w/o patch:	80.212s
8xA15 guest w/o patch:	Could not be bothered to find out

2xA15 guest w/ patch:	2.102s
4xA15 guest w/ patch:	3.205s
8xA15 guest w/ patch:	6.887s

So we go from a 40x degradation to 1.5x in the 2x overcommit case,
which is vaguely more acceptable.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
---
 arch/arm/include/asm/kvm_arm.h | 4 +++-
 arch/arm/kvm/Kconfig           | 1 +
 arch/arm/kvm/handle_exit.c     | 6 +++++-
 3 files changed, 9 insertions(+), 2 deletions(-)

Comments

Christoffer Dall Oct. 16, 2013, 1:14 a.m. UTC | #1
On Tue, Oct 08, 2013 at 06:38:13PM +0100, Marc Zyngier wrote:
> On an (even slightly) oversubscribed system, spinlocks are quickly
> becoming a bottleneck, as some vcpus are spinning, waiting for a
> lock to be released, while the vcpu holding the lock may not be
> running at all.
> 
> This creates contention, and the observed slowdown is 40x for
> hackbench. No, this isn't a typo.
> 
> The solution is to trap blocking WFEs and tell KVM that we're
> now spinning. This ensures that other vpus will get a scheduling
> boost, allowing the lock to be released more quickly. Also, using
> CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT slightly improves the performance
> when the VM is severely overcommited.
> 
> Quick test to estimate the performance: hackbench 1 process 1000
> 
> 2xA15 host (baseline):	1.843s
> 
> 2xA15 guest w/o patch:	2.083s
> 4xA15 guest w/o patch:	80.212s
> 8xA15 guest w/o patch:	Could not be bothered to find out
> 
> 2xA15 guest w/ patch:	2.102s
> 4xA15 guest w/ patch:	3.205s
> 8xA15 guest w/ patch:	6.887s
> 
> So we go from a 40x degradation to 1.5x in the 2x overcommit case,
> which is vaguely more acceptable.
> 
Patch looks good, I can just apply it and add the other one I just send
as a reply if there are no objections.

Sorry for the long turn-around on this one.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Marc Zyngier Oct. 16, 2013, 7:08 a.m. UTC | #2
On 2013-10-16 02:14, Christoffer Dall wrote:
> On Tue, Oct 08, 2013 at 06:38:13PM +0100, Marc Zyngier wrote:
>> On an (even slightly) oversubscribed system, spinlocks are quickly
>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>> lock to be released, while the vcpu holding the lock may not be
>> running at all.
>>
>> This creates contention, and the observed slowdown is 40x for
>> hackbench. No, this isn't a typo.
>>
>> The solution is to trap blocking WFEs and tell KVM that we're
>> now spinning. This ensures that other vpus will get a scheduling
>> boost, allowing the lock to be released more quickly. Also, using
>> CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT slightly improves the 
>> performance
>> when the VM is severely overcommited.
>>
>> Quick test to estimate the performance: hackbench 1 process 1000
>>
>> 2xA15 host (baseline):	1.843s
>>
>> 2xA15 guest w/o patch:	2.083s
>> 4xA15 guest w/o patch:	80.212s
>> 8xA15 guest w/o patch:	Could not be bothered to find out
>>
>> 2xA15 guest w/ patch:	2.102s
>> 4xA15 guest w/ patch:	3.205s
>> 8xA15 guest w/ patch:	6.887s
>>
>> So we go from a 40x degradation to 1.5x in the 2x overcommit case,
>> which is vaguely more acceptable.
>>
> Patch looks good, I can just apply it and add the other one I just 
> send
> as a reply if there are no objections.

Yeah, I missed the updated comments on this one, thanks for taking care 
of it.

> Sorry for the long turn-around on this one.

No worries. As long as it goes in, I'm happy. It makes such a 
difference on my box, it is absolutely mind boggling.

Thanks,

          M.
Christoffer Dall Oct. 16, 2013, 4:55 p.m. UTC | #3
On 16 October 2013 00:08, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On 2013-10-16 02:14, Christoffer Dall wrote:
>>
>> On Tue, Oct 08, 2013 at 06:38:13PM +0100, Marc Zyngier wrote:
>>>
>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>> lock to be released, while the vcpu holding the lock may not be
>>> running at all.
>>>
>>> This creates contention, and the observed slowdown is 40x for
>>> hackbench. No, this isn't a typo.
>>>
>>> The solution is to trap blocking WFEs and tell KVM that we're
>>> now spinning. This ensures that other vpus will get a scheduling
>>> boost, allowing the lock to be released more quickly. Also, using
>>> CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT slightly improves the performance
>>> when the VM is severely overcommited.
>>>
>>> Quick test to estimate the performance: hackbench 1 process 1000
>>>
>>> 2xA15 host (baseline):  1.843s
>>>
>>> 2xA15 guest w/o patch:  2.083s
>>> 4xA15 guest w/o patch:  80.212s
>>> 8xA15 guest w/o patch:  Could not be bothered to find out
>>>
>>> 2xA15 guest w/ patch:   2.102s
>>> 4xA15 guest w/ patch:   3.205s
>>> 8xA15 guest w/ patch:   6.887s
>>>
>>> So we go from a 40x degradation to 1.5x in the 2x overcommit case,
>>> which is vaguely more acceptable.
>>>
>> Patch looks good, I can just apply it and add the other one I just send
>> as a reply if there are no objections.
>
>
> Yeah, I missed the updated comments on this one, thanks for taking care of
> it.
>

np.


>
>> Sorry for the long turn-around on this one.
>
>
> No worries. As long as it goes in, I'm happy. It makes such a difference on
> my box, it is absolutely mind boggling.
>
Applied to kvm-arm-next.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
index 64e9696..693d5b2 100644
--- a/arch/arm/include/asm/kvm_arm.h
+++ b/arch/arm/include/asm/kvm_arm.h
@@ -67,7 +67,7 @@ 
  */
 #define HCR_GUEST_MASK (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \
 			HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \
-			HCR_SWIO | HCR_TIDCP)
+			HCR_TWE | HCR_SWIO | HCR_TIDCP)
 #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF)
 
 /* System Control Register (SCTLR) bits */
@@ -208,6 +208,8 @@ 
 #define HSR_EC_DABT	(0x24)
 #define HSR_EC_DABT_HYP	(0x25)
 
+#define HSR_WFI_IS_WFE		(1U << 0)
+
 #define HSR_HVC_IMM_MASK	((1UL << 16) - 1)
 
 #define HSR_DABT_S1PTW		(1U << 7)
diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
index ebf5015..466bd29 100644
--- a/arch/arm/kvm/Kconfig
+++ b/arch/arm/kvm/Kconfig
@@ -20,6 +20,7 @@  config KVM
 	bool "Kernel-based Virtual Machine (KVM) support"
 	select PREEMPT_NOTIFIERS
 	select ANON_INODES
+	select HAVE_KVM_CPU_RELAX_INTERCEPT
 	select KVM_MMIO
 	select KVM_ARM_HOST
 	depends on ARM_VIRT_EXT && ARM_LPAE
diff --git a/arch/arm/kvm/handle_exit.c b/arch/arm/kvm/handle_exit.c
index df4c82d..c4c496f 100644
--- a/arch/arm/kvm/handle_exit.c
+++ b/arch/arm/kvm/handle_exit.c
@@ -84,7 +84,11 @@  static int handle_dabt_hyp(struct kvm_vcpu *vcpu, struct kvm_run *run)
 static int kvm_handle_wfi(struct kvm_vcpu *vcpu, struct kvm_run *run)
 {
 	trace_kvm_wfi(*vcpu_pc(vcpu));
-	kvm_vcpu_block(vcpu);
+	if (kvm_vcpu_get_hsr(vcpu) & HSR_WFI_IS_WFE)
+		kvm_vcpu_on_spin(vcpu);
+	else
+		kvm_vcpu_block(vcpu);
+
 	return 1;
 }