[RFC] KVM: arm/arm64: Enable direct irqfd MSI injection

Message ID	1552833373-19828-1-git-send-email-yuzenghui@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Zenghui Yu <yuzenghui@huawei.com> To: <marc.zyngier@arm.com>, <christoffer.dall@arm.com>, <eric.auger@redhat.com>, <andre.przywara@arm.com> CC: <james.morse@arm.com>, <julien.thierry@arm.com>, <suzuki.poulose@arm.com>, <kvmarm@lists.cs.columbia.edu>, <mst@redhat.com>, <pbonzini@redhat.com>, <rkrcmar@redhat.com>, <kvm@vger.kernel.org>, <wanghaibin.wang@huawei.com>, <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, Zenghui Yu <yuzenghui@huawei.com> Subject: [RFC PATCH] KVM: arm/arm64: Enable direct irqfd MSI injection Date: Sun, 17 Mar 2019 14:36:13 +0000 Message-ID: <1552833373-19828-1-git-send-email-yuzenghui@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	[RFC] KVM: arm/arm64: Enable direct irqfd MSI injection \| expand [RFC] KVM: arm/arm64: Enable direct irqfd MSI injection

Zenghui Yu March 17, 2019, 2:36 p.m. UTC

Currently, IRQFD on arm still uses the deferred workqueue mechanism
to inject interrupts into guest, which will likely lead to a busy
context-switching from/to the kworker thread. This overhead is for
no purpose (only in my view ...) and will result in an interrupt
performance degradation.

Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
irqfd MSI injection, by which we can get rid of the annoying latency.
As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
processing workloads) will benefit from it.

Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
---

It seems that only MSI will follow the IRQFD path, did I miss something?

This patch is still under test and sent out for early feedback. If I have
any mis-understanding, please fix me up and let me know. Thanks!

---
 virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
 virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
 2 files changed, 43 insertions(+)

KarimAllah Ahmed March 17, 2019, 2:50 p.m. UTC | #1

On Sun, 2019-03-17 at 14:36 +0000, Zenghui Yu wrote:
> Currently, IRQFD on arm still uses the deferred workqueue mechanism
> to inject interrupts into guest, which will likely lead to a busy
> context-switching from/to the kworker thread. This overhead is for
> no purpose (only in my view ...) and will result in an interrupt
> performance degradation.
> 
> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
> irqfd MSI injection, by which we can get rid of the annoying latency.
> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
> processing workloads) will benefit from it.
> 
> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
> ---
> 
> It seems that only MSI will follow the IRQFD path, did I miss something?
> 
> This patch is still under test and sent out for early feedback. If I have
> any mis-understanding, please fix me up and let me know. Thanks!
> 
> ---
>  virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
>  virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
> index 55fed77..bc1f4db 100644
> --- a/virt/kvm/arm/vgic/trace.h
> +++ b/virt/kvm/arm/vgic/trace.h
> @@ -27,6 +27,28 @@
>  		  __entry->vcpu_id, __entry->irq, __entry->level)
>  );
>  
> +TRACE_EVENT(kvm_arch_set_irq_inatomic,
> +	TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
> +	TP_ARGS(gsi, type, level, irq_source_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,	gsi		)
> +		__field(	u32,	type		)
> +		__field(	int,	level		)
> +		__field(	int,	irq_source_id	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->gsi		= gsi;
> +		__entry->type		= type;
> +		__entry->level		= level;
> +		__entry->irq_source_id	= irq_source_id;
> +	),
> +
> +	TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
> +		  __entry->type, __entry->level, __entry->irq_source_id)
> +);
> +
>  #endif /* _TRACE_VGIC_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
> index 99e026d..4cfc3f4 100644
> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
> @@ -19,6 +19,7 @@
>  #include <trace/events/kvm.h>
>  #include <kvm/arm_vgic.h>
>  #include "vgic.h"
> +#include "trace.h"
>  
>  /**
>   * vgic_irqfd_set_irq: inject the IRQ corresponding to the
> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>  	return vgic_its_inject_msi(kvm, &msi);
>  }
>  
> +/**
> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
> + *
> + * Currently only direct MSI injecton is supported.
> + */
> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
> +			      struct kvm *kvm, int irq_source_id, int level,
> +			      bool line_status)
> +{
> +	int ret;
> +
> +	trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
> +
> +	if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
> +		return -EWOULDBLOCK;
> +
> +	ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);

The implementation of kvm_set_msi is not atomic. There is a mutex held in one
of the execution paths. That is why it can not be used directly in this atomic 
context.

> +	return ret;
> +}
> +
>  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm)
>  {
>  	struct kvm_irq_routing_entry *entries;



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Eric Auger March 17, 2019, 6:05 p.m. UTC | #2

Hi,

On 3/17/19 3:50 PM, Raslan, KarimAllah wrote:
> On Sun, 2019-03-17 at 14:36 +0000, Zenghui Yu wrote:
>> Currently, IRQFD on arm still uses the deferred workqueue mechanism
>> to inject interrupts into guest, which will likely lead to a busy
>> context-switching from/to the kworker thread. This overhead is for
>> no purpose (only in my view ...) and will result in an interrupt
>> performance degradation.
>>
>> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
>> irqfd MSI injection, by which we can get rid of the annoying latency.
>> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
>> processing workloads) will benefit from it.
>>
>> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
>> ---
>>
>> It seems that only MSI will follow the IRQFD path, did I miss something?
>>
>> This patch is still under test and sent out for early feedback. If I have
>> any mis-understanding, please fix me up and let me know. Thanks!
>>
>> ---
>>  virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
>>  virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
>>  2 files changed, 43 insertions(+)
>>
>> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
>> index 55fed77..bc1f4db 100644
>> --- a/virt/kvm/arm/vgic/trace.h
>> +++ b/virt/kvm/arm/vgic/trace.h
>> @@ -27,6 +27,28 @@
>>  		  __entry->vcpu_id, __entry->irq, __entry->level)
>>  );
>>  
>> +TRACE_EVENT(kvm_arch_set_irq_inatomic,
>> +	TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
>> +	TP_ARGS(gsi, type, level, irq_source_id),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,	gsi		)
>> +		__field(	u32,	type		)
>> +		__field(	int,	level		)
>> +		__field(	int,	irq_source_id	)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->gsi		= gsi;
>> +		__entry->type		= type;
>> +		__entry->level		= level;
>> +		__entry->irq_source_id	= irq_source_id;
>> +	),
>> +
>> +	TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
>> +		  __entry->type, __entry->level, __entry->irq_source_id)
>> +);
>> +
>>  #endif /* _TRACE_VGIC_H */
>>  
>>  #undef TRACE_INCLUDE_PATH
>> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
>> index 99e026d..4cfc3f4 100644
>> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
>> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
>> @@ -19,6 +19,7 @@
>>  #include <trace/events/kvm.h>
>>  #include <kvm/arm_vgic.h>
>>  #include "vgic.h"
>> +#include "trace.h"
>>  
>>  /**
>>   * vgic_irqfd_set_irq: inject the IRQ corresponding to the
>> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>>  	return vgic_its_inject_msi(kvm, &msi);
>>  }
>>  
>> +/**
>> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
>> + *
>> + * Currently only direct MSI injecton is supported.
>> + */
>> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
>> +			      struct kvm *kvm, int irq_source_id, int level,
>> +			      bool line_status)
>> +{
>> +	int ret;
>> +
>> +	trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
>> +
>> +	if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
>> +		return -EWOULDBLOCK;
>> +
>> +	ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);
> 
> The implementation of kvm_set_msi is not atomic. There is a mutex held in one
> of the execution paths. That is why it can not be used directly in this atomic 
> context.
Yep, vgic_its_trigger_msi call is protected by the its->its_lock.

Thanks

Eric
> 
>> +	return ret;
>> +}
>> +
>>  int kvm_vgic_setup_default_irq_routing(struct kvm *kvm)
>>  {
>>  	struct kvm_irq_routing_entry *entries;
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
> Ust-ID: DE 289 237 879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>

Marc Zyngier March 17, 2019, 7:35 p.m. UTC | #3

On Sun, 17 Mar 2019 14:36:13 +0000,
Zenghui Yu <yuzenghui@huawei.com> wrote:
> 
> Currently, IRQFD on arm still uses the deferred workqueue mechanism
> to inject interrupts into guest, which will likely lead to a busy
> context-switching from/to the kworker thread. This overhead is for
> no purpose (only in my view ...) and will result in an interrupt
> performance degradation.
> 
> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
> irqfd MSI injection, by which we can get rid of the annoying latency.
> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
> processing workloads) will benefit from it.
> 
> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
> ---
> 
> It seems that only MSI will follow the IRQFD path, did I miss something?
> 
> This patch is still under test and sent out for early feedback. If I have
> any mis-understanding, please fix me up and let me know. Thanks!

As mentioned by other folks in the thread, this is clearly wrong. The
first thing kvm_inject_msi does is to lock the corresponding ITS using
a mutex. So the "no purpose" bit was a bit too quick.

When doing this kind of work, I suggest you enable lockdep and all the
related checkers. Also, for any optimisation, please post actual
numbers for the relevant benchmarks. Saying "application X will
benefit from it" is meaningless without any actual data.

> 
> ---
>  virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
>  virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
>  2 files changed, 43 insertions(+)
> 
> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
> index 55fed77..bc1f4db 100644
> --- a/virt/kvm/arm/vgic/trace.h
> +++ b/virt/kvm/arm/vgic/trace.h
> @@ -27,6 +27,28 @@
>  		  __entry->vcpu_id, __entry->irq, __entry->level)
>  );
>  
> +TRACE_EVENT(kvm_arch_set_irq_inatomic,
> +	TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
> +	TP_ARGS(gsi, type, level, irq_source_id),
> +
> +	TP_STRUCT__entry(
> +		__field(	u32,	gsi		)
> +		__field(	u32,	type		)
> +		__field(	int,	level		)
> +		__field(	int,	irq_source_id	)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->gsi		= gsi;
> +		__entry->type		= type;
> +		__entry->level		= level;
> +		__entry->irq_source_id	= irq_source_id;
> +	),
> +
> +	TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
> +		  __entry->type, __entry->level, __entry->irq_source_id)
> +);
> +
>  #endif /* _TRACE_VGIC_H */
>  
>  #undef TRACE_INCLUDE_PATH
> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
> index 99e026d..4cfc3f4 100644
> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
> @@ -19,6 +19,7 @@
>  #include <trace/events/kvm.h>
>  #include <kvm/arm_vgic.h>
>  #include "vgic.h"
> +#include "trace.h"
>  
>  /**
>   * vgic_irqfd_set_irq: inject the IRQ corresponding to the
> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>  	return vgic_its_inject_msi(kvm, &msi);
>  }
>  
> +/**
> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
> + *
> + * Currently only direct MSI injecton is supported.
> + */
> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
> +			      struct kvm *kvm, int irq_source_id, int level,
> +			      bool line_status)
> +{
> +	int ret;
> +
> +	trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
> +
> +	if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
> +		return -EWOULDBLOCK;
> +
> +	ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);
> +	return ret;
> +}
> +

Although we've established that the approach is wrong, maybe we can
look at improving this aspect.

A first approach would be to keep a small cache of the last few
successful translations for this ITS, cache that could be looked-up by
holding a spinlock instead. A hit in this cache could directly be
injected. Any command that invalidates or changes anything (DISCARD,
INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
the cache altogether.

Of course, all of that needs to be quantified.

Thanks,

	M.

Marc Zyngier March 18, 2019, 1:30 p.m. UTC | #4

On Sun, 17 Mar 2019 19:35:48 +0000
Marc Zyngier <marc.zyngier@arm.com> wrote:

[...]

> A first approach would be to keep a small cache of the last few
> successful translations for this ITS, cache that could be looked-up by
> holding a spinlock instead. A hit in this cache could directly be
> injected. Any command that invalidates or changes anything (DISCARD,
> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> the cache altogether.

And to explain what I meant with this, I've pushed a branch[1] with a
basic prototype. It is good enough to get a VM to boot, but I wouldn't
trust it for anything serious just yet.

If anyone feels like giving it a go and check whether it has any
benefit performance wise, please do so.

Thanks,

	M.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/its-translation-cache

Zenghui Yu March 19, 2019, 1:09 a.m. UTC | #5

Hi all,

On 2019/3/18 3:35, Marc Zyngier wrote:
> On Sun, 17 Mar 2019 14:36:13 +0000,
> Zenghui Yu <yuzenghui@huawei.com> wrote:
>>
>> Currently, IRQFD on arm still uses the deferred workqueue mechanism
>> to inject interrupts into guest, which will likely lead to a busy
>> context-switching from/to the kworker thread. This overhead is for
>> no purpose (only in my view ...) and will result in an interrupt
>> performance degradation.
>>
>> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
>> irqfd MSI injection, by which we can get rid of the annoying latency.
>> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
>> processing workloads) will benefit from it.
>>
>> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
>> ---
>>
>> It seems that only MSI will follow the IRQFD path, did I miss something?
>>
>> This patch is still under test and sent out for early feedback. If I have
>> any mis-understanding, please fix me up and let me know. Thanks!
> 
> As mentioned by other folks in the thread, this is clearly wrong. The
> first thing kvm_inject_msi does is to lock the corresponding ITS using
> a mutex. So the "no purpose" bit was a bit too quick.
> 
> When doing this kind of work, I suggest you enable lockdep and all the
> related checkers. Also, for any optimisation, please post actual
> numbers for the relevant benchmarks. Saying "application X will
> benefit from it" is meaningless without any actual data.
> 
>>
>> ---
>>   virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
>>   virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
>>   2 files changed, 43 insertions(+)
>>
>> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
>> index 55fed77..bc1f4db 100644
>> --- a/virt/kvm/arm/vgic/trace.h
>> +++ b/virt/kvm/arm/vgic/trace.h
>> @@ -27,6 +27,28 @@
>>   		  __entry->vcpu_id, __entry->irq, __entry->level)
>>   );
>>   
>> +TRACE_EVENT(kvm_arch_set_irq_inatomic,
>> +	TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
>> +	TP_ARGS(gsi, type, level, irq_source_id),
>> +
>> +	TP_STRUCT__entry(
>> +		__field(	u32,	gsi		)
>> +		__field(	u32,	type		)
>> +		__field(	int,	level		)
>> +		__field(	int,	irq_source_id	)
>> +	),
>> +
>> +	TP_fast_assign(
>> +		__entry->gsi		= gsi;
>> +		__entry->type		= type;
>> +		__entry->level		= level;
>> +		__entry->irq_source_id	= irq_source_id;
>> +	),
>> +
>> +	TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
>> +		  __entry->type, __entry->level, __entry->irq_source_id)
>> +);
>> +
>>   #endif /* _TRACE_VGIC_H */
>>   
>>   #undef TRACE_INCLUDE_PATH
>> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
>> index 99e026d..4cfc3f4 100644
>> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
>> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
>> @@ -19,6 +19,7 @@
>>   #include <trace/events/kvm.h>
>>   #include <kvm/arm_vgic.h>
>>   #include "vgic.h"
>> +#include "trace.h"
>>   
>>   /**
>>    * vgic_irqfd_set_irq: inject the IRQ corresponding to the
>> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
>>   	return vgic_its_inject_msi(kvm, &msi);
>>   }
>>   
>> +/**
>> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
>> + *
>> + * Currently only direct MSI injecton is supported.
>> + */
>> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
>> +			      struct kvm *kvm, int irq_source_id, int level,
>> +			      bool line_status)
>> +{
>> +	int ret;
>> +
>> +	trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
>> +
>> +	if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
>> +		return -EWOULDBLOCK;
>> +
>> +	ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);
>> +	return ret;
>> +}
>> +
> 
> Although we've established that the approach is wrong, maybe we can
> look at improving this aspect.
> 
> A first approach would be to keep a small cache of the last few
> successful translations for this ITS, cache that could be looked-up by
> holding a spinlock instead. A hit in this cache could directly be
> injected. Any command that invalidates or changes anything (DISCARD,
> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> the cache altogether.
> 
> Of course, all of that needs to be quantified.

Thanks for all of your explanations, especially for Marc's suggestions!
It took me long time to figure out my mistakes, since I am not very
familiar with the locking stuff. Now I have to apologize for my noise.

As for the its-translation-cache code (a really good news to us), we
have a rough look at it and start testing now!


thanks,

zenghui

> 
> Thanks,
> 
> 	M.
>

Marc Zyngier March 19, 2019, 10:01 a.m. UTC | #6

On Tue, 19 Mar 2019 09:09:43 +0800
Zenghui Yu <yuzenghui@huawei.com> wrote:

> Hi all,
> 
> On 2019/3/18 3:35, Marc Zyngier wrote:
> > On Sun, 17 Mar 2019 14:36:13 +0000,
> > Zenghui Yu <yuzenghui@huawei.com> wrote:  
> >>
> >> Currently, IRQFD on arm still uses the deferred workqueue mechanism
> >> to inject interrupts into guest, which will likely lead to a busy
> >> context-switching from/to the kworker thread. This overhead is for
> >> no purpose (only in my view ...) and will result in an interrupt
> >> performance degradation.
> >>
> >> Implement kvm_arch_set_irq_inatomic() for arm/arm64 to support direct
> >> irqfd MSI injection, by which we can get rid of the annoying latency.
> >> As a result, irqfd MSI intensive scenarios (e.g., DPDK with high packet
> >> processing workloads) will benefit from it.
> >>
> >> Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
> >> ---
> >>
> >> It seems that only MSI will follow the IRQFD path, did I miss something?
> >>
> >> This patch is still under test and sent out for early feedback. If I have
> >> any mis-understanding, please fix me up and let me know. Thanks!  
> > 
> > As mentioned by other folks in the thread, this is clearly wrong. The
> > first thing kvm_inject_msi does is to lock the corresponding ITS using
> > a mutex. So the "no purpose" bit was a bit too quick.
> > 
> > When doing this kind of work, I suggest you enable lockdep and all the
> > related checkers. Also, for any optimisation, please post actual
> > numbers for the relevant benchmarks. Saying "application X will
> > benefit from it" is meaningless without any actual data.
> >   
> >>
> >> ---
> >>   virt/kvm/arm/vgic/trace.h      | 22 ++++++++++++++++++++++
> >>   virt/kvm/arm/vgic/vgic-irqfd.c | 21 +++++++++++++++++++++
> >>   2 files changed, 43 insertions(+)
> >>
> >> diff --git a/virt/kvm/arm/vgic/trace.h b/virt/kvm/arm/vgic/trace.h
> >> index 55fed77..bc1f4db 100644
> >> --- a/virt/kvm/arm/vgic/trace.h
> >> +++ b/virt/kvm/arm/vgic/trace.h
> >> @@ -27,6 +27,28 @@
> >>   		  __entry->vcpu_id, __entry->irq, __entry->level)
> >>   );  
> >>   >> +TRACE_EVENT(kvm_arch_set_irq_inatomic,  
> >> +	TP_PROTO(u32 gsi, u32 type, int level, int irq_source_id),
> >> +	TP_ARGS(gsi, type, level, irq_source_id),
> >> +
> >> +	TP_STRUCT__entry(
> >> +		__field(	u32,	gsi		)
> >> +		__field(	u32,	type		)
> >> +		__field(	int,	level		)
> >> +		__field(	int,	irq_source_id	)
> >> +	),
> >> +
> >> +	TP_fast_assign(
> >> +		__entry->gsi		= gsi;
> >> +		__entry->type		= type;
> >> +		__entry->level		= level;
> >> +		__entry->irq_source_id	= irq_source_id;
> >> +	),
> >> +
> >> +	TP_printk("gsi %u type %u level %d source %d", __entry->gsi,
> >> +		  __entry->type, __entry->level, __entry->irq_source_id)
> >> +);
> >> +
> >>   #endif /* _TRACE_VGIC_H */  
> >>   >>   #undef TRACE_INCLUDE_PATH  
> >> diff --git a/virt/kvm/arm/vgic/vgic-irqfd.c b/virt/kvm/arm/vgic/vgic-irqfd.c
> >> index 99e026d..4cfc3f4 100644
> >> --- a/virt/kvm/arm/vgic/vgic-irqfd.c
> >> +++ b/virt/kvm/arm/vgic/vgic-irqfd.c
> >> @@ -19,6 +19,7 @@
> >>   #include <trace/events/kvm.h>
> >>   #include <kvm/arm_vgic.h>
> >>   #include "vgic.h"
> >> +#include "trace.h"  
> >>   >>   /**  
> >>    * vgic_irqfd_set_irq: inject the IRQ corresponding to the
> >> @@ -105,6 +106,26 @@ int kvm_set_msi(struct kvm_kernel_irq_routing_entry *e,
> >>   	return vgic_its_inject_msi(kvm, &msi);
> >>   }  
> >>   >> +/**  
> >> + * kvm_arch_set_irq_inatomic: fast-path for irqfd injection
> >> + *
> >> + * Currently only direct MSI injecton is supported.
> >> + */
> >> +int kvm_arch_set_irq_inatomic(struct kvm_kernel_irq_routing_entry *e,
> >> +			      struct kvm *kvm, int irq_source_id, int level,
> >> +			      bool line_status)
> >> +{
> >> +	int ret;
> >> +
> >> +	trace_kvm_arch_set_irq_inatomic(e->gsi, e->type, level, irq_source_id);
> >> +
> >> +	if (unlikely(e->type != KVM_IRQ_ROUTING_MSI))
> >> +		return -EWOULDBLOCK;
> >> +
> >> +	ret = kvm_set_msi(e, kvm, irq_source_id, level, line_status);
> >> +	return ret;
> >> +}
> >> +  
> > 
> > Although we've established that the approach is wrong, maybe we can
> > look at improving this aspect.
> > 
> > A first approach would be to keep a small cache of the last few
> > successful translations for this ITS, cache that could be looked-up by
> > holding a spinlock instead. A hit in this cache could directly be
> > injected. Any command that invalidates or changes anything (DISCARD,
> > INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> > the cache altogether.
> > 
> > Of course, all of that needs to be quantified.  
> 
> Thanks for all of your explanations, especially for Marc's suggestions!
> It took me long time to figure out my mistakes, since I am not very
> familiar with the locking stuff. Now I have to apologize for my noise.

No need to apologize. The whole point of this list is to have
discussions. Although your approach wasn't working, you did
identify potential room for improvement.

> As for the its-translation-cache code (a really good news to us), we
> have a rough look at it and start testing now!

Please let me know about your findings. My initial test doesn't show
any improvement, but that could easily be attributed to the system I
running this on (a tiny and slightly broken dual A53 system). The sizing
of the cache is also important: too small, and you have the overhead of
the lookup for no benefit; too big, and you waste memory.

Having thought about it a bit more, I think we can drop the
invalidation on MOVI/MOVALL, as the LPI is still perfectly valid, and
we don't cache the target vcpu. On the other hand, the cache must be
nuked when the ITS is turned off.

Thanks,

	M.

Zenghui Yu March 19, 2019, 3:59 p.m. UTC | #7

Hi Marc,

On 2019/3/19 18:01, Marc Zyngier wrote:
> On Tue, 19 Mar 2019 09:09:43 +0800
> Zenghui Yu <yuzenghui@huawei.com> wrote:
> 
>> Hi all,
>>
>> On 2019/3/18 3:35, Marc Zyngier wrote:
>>> A first approach would be to keep a small cache of the last few
>>> successful translations for this ITS, cache that could be looked-up by
>>> holding a spinlock instead. A hit in this cache could directly be
>>> injected. Any command that invalidates or changes anything (DISCARD,
>>> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
>>> the cache altogether.
>>>
>>> Of course, all of that needs to be quantified.
>>
>> Thanks for all of your explanations, especially for Marc's suggestions!
>> It took me long time to figure out my mistakes, since I am not very
>> familiar with the locking stuff. Now I have to apologize for my noise.
> 
> No need to apologize. The whole point of this list is to have
> discussions. Although your approach wasn't working, you did
> identify potential room for improvement.
> 
>> As for the its-translation-cache code (a really good news to us), we
>> have a rough look at it and start testing now!
> 
> Please let me know about your findings. My initial test doesn't show
> any improvement, but that could easily be attributed to the system I
> running this on (a tiny and slightly broken dual A53 system). The sizing
> of the cache is also important: too small, and you have the overhead of
> the lookup for no benefit; too big, and you waste memory.

Not smoothly as expected. With below config (in the form of XML):

---8<---
     <interface type='vhostuser'>
       <source type='unix' path='/var/run/vhost-user/tap_0' mode='client'/>
       <model type='virtio'/>
       <driver name='vhost' queues='32' vringbuf='4096'/>
     </interface>
---8<---

VM can't even get to boot successfully!


Kernel version is -stable 4.19.28. And *dmesg* on host shows:

---8<---
[  507.908330] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  507.908338] rcu:     35-...0: (0 ticks this GP) 
idle=d06/1/0x4000000000000000 softirq=72150/72150 fqs=6269
[  507.908341] rcu:     41-...0: (0 ticks this GP) 
idle=dee/1/0x4000000000000000 softirq=68144/68144 fqs=6269
[  507.908342] rcu:     (detected by 23, t=15002 jiffies, g=68929, q=408641)
[  507.908350] Task dump for CPU 35:
[  507.908351] qemu-kvm        R  running task        0 66789      1 
0x00000002
[  507.908354] Call trace:
[  507.908360]  __switch_to+0x94/0xe8
[  507.908363]  _cond_resched+0x24/0x68
[  507.908366]  __flush_work+0x58/0x280
[  507.908369]  free_unref_page_commit+0xc4/0x198
[  507.908370]  free_unref_page+0x84/0xa0
[  507.908371]  __free_pages+0x58/0x68
[  507.908372]  free_pages.part.21+0x34/0x40
[  507.908373]  free_pages+0x2c/0x38
[  507.908375]  poll_freewait+0xa8/0xd0
[  507.908377]  do_sys_poll+0x3d0/0x560
[  507.908378]  __arm64_sys_ppoll+0x180/0x1e8
[  507.908380]  0xa48990
[  507.908381] Task dump for CPU 41:
[  507.908382] kworker/41:1    R  running task        0   647      2 
0x0000002a
[  507.908387] Workqueue: events irqfd_inject
[  507.908389] Call trace:
[  507.908391]  __switch_to+0x94/0xe8
[  507.908392]  0x200000131
[... ...]
[  687.928330] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  687.928339] rcu:     35-...0: (0 ticks this GP) 
idle=d06/1/0x4000000000000000 softirq=72150/72150 fqs=25034
[  687.928341] rcu:     41-...0: (0 ticks this GP) 
idle=dee/1/0x4000000000000000 softirq=68144/68144 fqs=25034
[  687.928343] rcu:     (detected by 16, t=60007 jiffies, g=68929, 
q=1601093)
[  687.928351] Task dump for CPU 35:
[  687.928352] qemu-kvm        R  running task        0 66789      1 
0x00000002
[  687.928355] Call trace:
[  687.928360]  __switch_to+0x94/0xe8
[  687.928364]  _cond_resched+0x24/0x68
[  687.928367]  __flush_work+0x58/0x280
[  687.928369]  free_unref_page_commit+0xc4/0x198
[  687.928370]  free_unref_page+0x84/0xa0
[  687.928372]  __free_pages+0x58/0x68
[  687.928373]  free_pages.part.21+0x34/0x40
[  687.928374]  free_pages+0x2c/0x38
[  687.928376]  poll_freewait+0xa8/0xd0
[  687.928378]  do_sys_poll+0x3d0/0x560
[  687.928379]  __arm64_sys_ppoll+0x180/0x1e8
[  687.928381]  0xa48990
[  687.928382] Task dump for CPU 41:
[  687.928383] kworker/41:1    R  running task        0   647      2 
0x0000002a
[  687.928389] Workqueue: events irqfd_inject
[  687.928391] Call trace:
[  687.928392]  __switch_to+0x94/0xe8
[  687.928394]  0x200000131
[...]
---8<---   endlessly ...

It seems that we've suffered from some locking related issues. Any
suggestions for debugging?

And could you please provide your test steps ? So that I can run
some tests on my HW to see improvement hopefully.

> 
> Having thought about it a bit more, I think we can drop the
> invalidation on MOVI/MOVALL, as the LPI is still perfectly valid, and
> we don't cache the target vcpu. On the other hand, the cache must be
> nuked when the ITS is turned off.

All of these are valuable. But it might be early for me to consider
about them (I have to get the above problem solved first ...)


thanks,

zenghui

> 
> Thanks,
> 
> 	M.
>

Marc Zyngier March 19, 2019, 4:57 p.m. UTC | #8

On Tue, 19 Mar 2019 15:59:00 +0000,
Zenghui Yu <yuzenghui@huawei.com> wrote:
> 
> Hi Marc,
> 
> On 2019/3/19 18:01, Marc Zyngier wrote:
> > On Tue, 19 Mar 2019 09:09:43 +0800
> > Zenghui Yu <yuzenghui@huawei.com> wrote:
> > 
> >> Hi all,
> >> 
> >> On 2019/3/18 3:35, Marc Zyngier wrote:
> >>> A first approach would be to keep a small cache of the last few
> >>> successful translations for this ITS, cache that could be looked-up by
> >>> holding a spinlock instead. A hit in this cache could directly be
> >>> injected. Any command that invalidates or changes anything (DISCARD,
> >>> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> >>> the cache altogether.
> >>> 
> >>> Of course, all of that needs to be quantified.
> >> 
> >> Thanks for all of your explanations, especially for Marc's suggestions!
> >> It took me long time to figure out my mistakes, since I am not very
> >> familiar with the locking stuff. Now I have to apologize for my noise.
> > 
> > No need to apologize. The whole point of this list is to have
> > discussions. Although your approach wasn't working, you did
> > identify potential room for improvement.
> > 
> >> As for the its-translation-cache code (a really good news to us), we
> >> have a rough look at it and start testing now!
> > 
> > Please let me know about your findings. My initial test doesn't show
> > any improvement, but that could easily be attributed to the system I
> > running this on (a tiny and slightly broken dual A53 system). The sizing
> > of the cache is also important: too small, and you have the overhead of
> > the lookup for no benefit; too big, and you waste memory.
> 
> Not smoothly as expected. With below config (in the form of XML):

The good news is that nothing was expected at all.

> ---8<---
>     <interface type='vhostuser'>
>       <source type='unix' path='/var/run/vhost-user/tap_0' mode='client'/>
>       <model type='virtio'/>
>       <driver name='vhost' queues='32' vringbuf='4096'/>
>     </interface>
> ---8<---

Sorry, I don't read XML, and I have zero idea what this represent.

> 
> VM can't even get to boot successfully!
> 
> 
> Kernel version is -stable 4.19.28. And *dmesg* on host shows:

Please don't test on any other thing but mainline. The only thing I'm
interested in at the moment is 5.1-rc1.

> 
> ---8<---
> [  507.908330] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [  507.908338] rcu:     35-...0: (0 ticks this GP)
> idle=d06/1/0x4000000000000000 softirq=72150/72150 fqs=6269
> [  507.908341] rcu:     41-...0: (0 ticks this GP)
> idle=dee/1/0x4000000000000000 softirq=68144/68144 fqs=6269
> [  507.908342] rcu:     (detected by 23, t=15002 jiffies, g=68929, q=408641)
> [  507.908350] Task dump for CPU 35:
> [  507.908351] qemu-kvm        R  running task        0 66789      1
> 0x00000002
> [  507.908354] Call trace:
> [  507.908360]  __switch_to+0x94/0xe8
> [  507.908363]  _cond_resched+0x24/0x68
> [  507.908366]  __flush_work+0x58/0x280
> [  507.908369]  free_unref_page_commit+0xc4/0x198
> [  507.908370]  free_unref_page+0x84/0xa0
> [  507.908371]  __free_pages+0x58/0x68
> [  507.908372]  free_pages.part.21+0x34/0x40
> [  507.908373]  free_pages+0x2c/0x38
> [  507.908375]  poll_freewait+0xa8/0xd0
> [  507.908377]  do_sys_poll+0x3d0/0x560
> [  507.908378]  __arm64_sys_ppoll+0x180/0x1e8
> [  507.908380]  0xa48990
> [  507.908381] Task dump for CPU 41:
> [  507.908382] kworker/41:1    R  running task        0   647      2
> 0x0000002a
> [  507.908387] Workqueue: events irqfd_inject
> [  507.908389] Call trace:
> [  507.908391]  __switch_to+0x94/0xe8
> [  507.908392]  0x200000131
> [... ...]
> [  687.928330] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
> [  687.928339] rcu:     35-...0: (0 ticks this GP)
> idle=d06/1/0x4000000000000000 softirq=72150/72150 fqs=25034
> [  687.928341] rcu:     41-...0: (0 ticks this GP)
> idle=dee/1/0x4000000000000000 softirq=68144/68144 fqs=25034
> [  687.928343] rcu:     (detected by 16, t=60007 jiffies, g=68929,
> q=1601093)
> [  687.928351] Task dump for CPU 35:
> [  687.928352] qemu-kvm        R  running task        0 66789      1
> 0x00000002
> [  687.928355] Call trace:
> [  687.928360]  __switch_to+0x94/0xe8
> [  687.928364]  _cond_resched+0x24/0x68
> [  687.928367]  __flush_work+0x58/0x280
> [  687.928369]  free_unref_page_commit+0xc4/0x198
> [  687.928370]  free_unref_page+0x84/0xa0
> [  687.928372]  __free_pages+0x58/0x68
> [  687.928373]  free_pages.part.21+0x34/0x40
> [  687.928374]  free_pages+0x2c/0x38
> [  687.928376]  poll_freewait+0xa8/0xd0
> [  687.928378]  do_sys_poll+0x3d0/0x560
> [  687.928379]  __arm64_sys_ppoll+0x180/0x1e8
> [  687.928381]  0xa48990
> [  687.928382] Task dump for CPU 41:
> [  687.928383] kworker/41:1    R  running task        0   647      2
> 0x0000002a
> [  687.928389] Workqueue: events irqfd_inject
> [  687.928391] Call trace:
> [  687.928392]  __switch_to+0x94/0xe8
> [  687.928394]  0x200000131
> [...]
> ---8<---   endlessly ...
> 
> It seems that we've suffered from some locking related issues. Any
> suggestions for debugging?

None at the moment. And this doesn't seem quite related to the problem
at hand, does it?

> And could you please provide your test steps ? So that I can run
> some tests on my HW to see improvement hopefully.

Here you go:

qemu-system-aarch64 -m 512M -smp 2 -cpu host,aarch64=on -machine virt,accel=kvm,gic_version=3,its -nographic -drive if=pflash,format=raw,readonly,file=/usr/share/AAVMF/AAVMF_CODE.fd -drive if=pflash,format=raw,file=buster/GXnkZdHqG4e7o4pC.fd -netdev tap,fds=128:129,id=hostnet0,vhost=on,vhostfds=130:131 -device virtio-net-pci,mac=5a:fe:00:e5:b1:30,netdev=hostnet0,mq=on,vectors=6 -drive if=none,format=raw,file=buster/GXnkZdHqG4e7o4pC.img,id=disk0 -device virtio-blk-pci,drive=disk0 -drive file=debian-testing-arm64-DVD-1-preseed.iso,id=cdrom,if=none,media=cdrom -device virtio-scsi-pci -device scsi-cd,drive=cdrom 128<>/dev/tap7 129<>/dev/tap7 130<>/dev/vhost-net 131<>/dev/vhost-net

> > Having thought about it a bit more, I think we can drop the
> > invalidation on MOVI/MOVALL, as the LPI is still perfectly valid, and
> > we don't cache the target vcpu. On the other hand, the cache must be
> > nuked when the ITS is turned off.
> 
> All of these are valuable. But it might be early for me to consider
> about them (I have to get the above problem solved first ...)

I'm not asking you to consider them. I jumped in this thread
explaining what could be done instead. These are ideas on top of what
I've already offered.

     M.

Andre Przywara May 15, 2019, 4:38 p.m. UTC | #9

On Mon, 18 Mar 2019 13:30:40 +0000
Marc Zyngier <marc.zyngier@arm.com> wrote:

Hi,

> On Sun, 17 Mar 2019 19:35:48 +0000
> Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> [...]
> 
> > A first approach would be to keep a small cache of the last few
> > successful translations for this ITS, cache that could be looked-up by
> > holding a spinlock instead. A hit in this cache could directly be
> > injected. Any command that invalidates or changes anything (DISCARD,
> > INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> > the cache altogether.  
> 
> And to explain what I meant with this, I've pushed a branch[1] with a
> basic prototype. It is good enough to get a VM to boot, but I wouldn't
> trust it for anything serious just yet.
> 
> If anyone feels like giving it a go and check whether it has any
> benefit performance wise, please do so.

So I took a stab at the performance aspect, and it took me a while to find
something where it actually makes a difference. The trick is to create *a
lot* of interrupts. This is my setup now:
- GICv3 and ITS
- 5.1.0 kernel vs. 5.1.0 plus Marc's rebased "ITS cache" patches on top
- 4 VCPU guest on a 4 core machine
- passing through a M.2 NVMe SSD (or a USB3 controller) to the guest
- running FIO in the guest, with:
  - 4K block size, random reads, queue depth 16, 4 jobs (small)
  - 1M block size, sequential reads, QD 1, 1 job (big)

For the NVMe disk I see a whopping 19% performance improvement with Marc's
series (for the small blocks). For a SATA SSD connected via USB3.0 I still
see 6% improvement. For NVMe there were 50,000 interrupts per second on
the host, the USB3 setup came only up to 10,000/s. For big blocks (with
IRQs in the low thousands/s) the win is less, but still a measurable 3%.

Now that I have the setup, I can rerun experiments very quickly (given I
don't loose access to the machine), so let me know if someone needs
further tests.

Cheers,
Andre.

> [1] https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/its-translation-cache

Marc Zyngier May 16, 2019, 7:21 a.m. UTC | #10

Hi Andre,

On Wed, 15 May 2019 17:38:32 +0100,
Andre Przywara <andre.przywara@arm.com> wrote:
> 
> On Mon, 18 Mar 2019 13:30:40 +0000
> Marc Zyngier <marc.zyngier@arm.com> wrote:
> 
> Hi,
> 
> > On Sun, 17 Mar 2019 19:35:48 +0000
> > Marc Zyngier <marc.zyngier@arm.com> wrote:
> > 
> > [...]
> > 
> > > A first approach would be to keep a small cache of the last few
> > > successful translations for this ITS, cache that could be looked-up by
> > > holding a spinlock instead. A hit in this cache could directly be
> > > injected. Any command that invalidates or changes anything (DISCARD,
> > > INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> > > the cache altogether.  
> > 
> > And to explain what I meant with this, I've pushed a branch[1] with a
> > basic prototype. It is good enough to get a VM to boot, but I wouldn't
> > trust it for anything serious just yet.
> > 
> > If anyone feels like giving it a go and check whether it has any
> > benefit performance wise, please do so.
> 
> So I took a stab at the performance aspect, and it took me a while to find
> something where it actually makes a difference. The trick is to create *a
> lot* of interrupts. This is my setup now:
> - GICv3 and ITS
> - 5.1.0 kernel vs. 5.1.0 plus Marc's rebased "ITS cache" patches on top
> - 4 VCPU guest on a 4 core machine
> - passing through a M.2 NVMe SSD (or a USB3 controller) to the guest
> - running FIO in the guest, with:
>   - 4K block size, random reads, queue depth 16, 4 jobs (small)
>   - 1M block size, sequential reads, QD 1, 1 job (big)
> 
> For the NVMe disk I see a whopping 19% performance improvement with Marc's
> series (for the small blocks). For a SATA SSD connected via USB3.0 I still
> see 6% improvement. For NVMe there were 50,000 interrupts per second on
> the host, the USB3 setup came only up to 10,000/s. For big blocks (with
> IRQs in the low thousands/s) the win is less, but still a measurable
> 3%.

Thanks for having a go at this, and identifying the case where it
actually matters (I would have hoped that the original reporter would
have helped with this, but hey, never mind). The results are pretty
impressive (more so than I anticipated), and I wonder whether we could
improve things further (50k interrupts/s is not that high -- I get
more than 100k on some machines just by playing with their sdcard...).

Could you describe how many interrupt sources each device has? The
reason I'm asking is that the cache size is pretty much hardcoded at
the moment (4 entries per vcpu), and that could have an impact on
performance if we keep evicting entries in the cache (note to self:
add some statistics for that).

Another area where we can improve things is that I think the
invalidation mechanism is pretty trigger happy (MOVI really doesn't
need to invalidate the cache). On the other hand, I'm not sure your
guest does too much of that.

Finally, the single cache spin-lock is bound to be a bottleneck of its
own at high interrupt rates, and I wonder whether we should move the
whole thing over to an RCU friendly data structure (the vgic_irq
structure really isn't that friendly). It'd be good to find out how
contended that spinlock is on your system.

> Now that I have the setup, I can rerun experiments very quickly (given I
> don't loose access to the machine), so let me know if someone needs
> further tests.

Another useful data point would be the delta with bare-metal: how much
overhead do we have with KVM, with and without this patch series. Oh,
and for easier comparison, please write it as a table that we can dump
in the cover letter when I actually post the series! ;-)

Thanks,

	M.

Zenghui Yu May 20, 2019, 3:31 p.m. UTC | #11

Hi Marc,

On 2019/5/16 15:21, Marc Zyngier wrote:
> Hi Andre,
> 
> On Wed, 15 May 2019 17:38:32 +0100,
> Andre Przywara <andre.przywara@arm.com> wrote:
>>
>> On Mon, 18 Mar 2019 13:30:40 +0000
>> Marc Zyngier <marc.zyngier@arm.com> wrote:
>>
>> Hi,
>>
>>> On Sun, 17 Mar 2019 19:35:48 +0000
>>> Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>
>>> [...]
>>>
>>>> A first approach would be to keep a small cache of the last few
>>>> successful translations for this ITS, cache that could be looked-up by
>>>> holding a spinlock instead. A hit in this cache could directly be
>>>> injected. Any command that invalidates or changes anything (DISCARD,
>>>> INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
>>>> the cache altogether.
>>>
>>> And to explain what I meant with this, I've pushed a branch[1] with a
>>> basic prototype. It is good enough to get a VM to boot, but I wouldn't
>>> trust it for anything serious just yet.
>>>
>>> If anyone feels like giving it a go and check whether it has any
>>> benefit performance wise, please do so.
>>
>> So I took a stab at the performance aspect, and it took me a while to find
>> something where it actually makes a difference. The trick is to create *a
>> lot* of interrupts. This is my setup now:
>> - GICv3 and ITS
>> - 5.1.0 kernel vs. 5.1.0 plus Marc's rebased "ITS cache" patches on top
>> - 4 VCPU guest on a 4 core machine
>> - passing through a M.2 NVMe SSD (or a USB3 controller) to the guest
>> - running FIO in the guest, with:
>>    - 4K block size, random reads, queue depth 16, 4 jobs (small)
>>    - 1M block size, sequential reads, QD 1, 1 job (big)
>>
>> For the NVMe disk I see a whopping 19% performance improvement with Marc's
>> series (for the small blocks). For a SATA SSD connected via USB3.0 I still
>> see 6% improvement. For NVMe there were 50,000 interrupts per second on
>> the host, the USB3 setup came only up to 10,000/s. For big blocks (with
>> IRQs in the low thousands/s) the win is less, but still a measurable
>> 3%.
> 
> Thanks for having a go at this, and identifying the case where it
> actually matters (I would have hoped that the original reporter would
> have helped with this, but hey, never mind). The results are pretty
> impressive (more so than I anticipated), and I wonder whether we could
> improve things further (50k interrupts/s is not that high -- I get
> more than 100k on some machines just by playing with their sdcard...).

I think the "original reporter" must feel embarrassed now.
Actually, we had tested your patches (based on about 5.1.0-rc2) but
failed to see performance improvement. And I stopped to move on, and
then two months had gone... Oh sorry!

We retest your patches on 5.1.0, the result is as below.

Test setup:
- GICv3 and ITS (on Taishan 2280, D05)
- two 4-VCPU guests with vhost-net interface
- run iperf in guests:
    - guest1: iperf -s
    - guest2: iperf -c guest1-IP -t 10
- pin vcpu threads and vhost threads on the same NUMA node

Result:
+-----------------+--------------+-----------------------+
| Result          | interrupts/s | bandwidth (Gbits/sec) |
+-----------------+--------------+-----------------------+
| 5.1.0           |    25+ k     |    10.6 Gbits/sec     |
+-----------------+--------------+-----------------------+
| 5.1.0 (patched) |    40+ k     |    10.2 Gbits/sec     |
+-----------------+--------------+-----------------------+

We get "interrupts/s" from /proc/interrupts on iperf server, with stable
measured results. And we get "bandwidth" directly from iperf, but the
results are somewhat *instable*. And the results really confused me --
we received more interrupts but get a slight lower performance, why?

We configure the vhost-net interface with only one queue, so I think we
can rule out the spin-lock influence. And 'perf lock' confirmed this.
This is all that I can provide now, sorry if it's useless.

Also, one minor nit in code:
In vgic_its_cache_translation(), we use vgic_put_irq() to evict the LRU
cache entry, while we're already holding the lpi_list_lock. A deadlock
will be caused here. But this is easy to fix.


Anyway, we always have enough environments (e.g., D05, D06, ...) to do
some tests. If you want to do further tests on our boards, please let me
know :)


thanks,
zenghui

> Could you describe how many interrupt sources each device has? The
> reason I'm asking is that the cache size is pretty much hardcoded at
> the moment (4 entries per vcpu), and that could have an impact on
> performance if we keep evicting entries in the cache (note to self:
> add some statistics for that).
> 
> Another area where we can improve things is that I think the
> invalidation mechanism is pretty trigger happy (MOVI really doesn't
> need to invalidate the cache). On the other hand, I'm not sure your
> guest does too much of that.
> 
> Finally, the single cache spin-lock is bound to be a bottleneck of its
> own at high interrupt rates, and I wonder whether we should move the
> whole thing over to an RCU friendly data structure (the vgic_irq
> structure really isn't that friendly). It'd be good to find out how
> contended that spinlock is on your system.
> 
>> Now that I have the setup, I can rerun experiments very quickly (given I
>> don't loose access to the machine), so let me know if someone needs
>> further tests.
> 
> Another useful data point would be the delta with bare-metal: how much
> overhead do we have with KVM, with and without this patch series. Oh,
> and for easier comparison, please write it as a table that we can dump
> in the cover letter when I actually post the series! ;-)
> 
> Thanks,
> 
> 	M.
>

KarimAllah Ahmed May 20, 2019, 6 p.m. UTC | #12

On Mon, 2019-05-20 at 23:31 +0800, Zenghui Yu wrote:
> Hi Marc,
> 
> On 2019/5/16 15:21, Marc Zyngier wrote:
> > 
> > Hi Andre,
> > 
> > On Wed, 15 May 2019 17:38:32 +0100,
> > Andre Przywara <andre.przywara@arm.com> wrote:
> > > 
> > > 
> > > On Mon, 18 Mar 2019 13:30:40 +0000
> > > Marc Zyngier <marc.zyngier@arm.com> wrote:
> > > 
> > > Hi,
> > > 
> > > > 
> > > > On Sun, 17 Mar 2019 19:35:48 +0000
> > > > Marc Zyngier <marc.zyngier@arm.com> wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > 
> > > > > A first approach would be to keep a small cache of the last few
> > > > > successful translations for this ITS, cache that could be looked-up by
> > > > > holding a spinlock instead. A hit in this cache could directly be
> > > > > injected. Any command that invalidates or changes anything (DISCARD,
> > > > > INV, INVALL, MAPC with V=0, MAPD with V=0, MOVALL, MOVI) should nuke
> > > > > the cache altogether.
> > > > 
> > > > And to explain what I meant with this, I've pushed a branch[1] with a
> > > > basic prototype. It is good enough to get a VM to boot, but I wouldn't
> > > > trust it for anything serious just yet.
> > > > 
> > > > If anyone feels like giving it a go and check whether it has any
> > > > benefit performance wise, please do so.
> > > 
> > > So I took a stab at the performance aspect, and it took me a while to find
> > > something where it actually makes a difference. The trick is to create *a
> > > lot* of interrupts. This is my setup now:
> > > - GICv3 and ITS
> > > - 5.1.0 kernel vs. 5.1.0 plus Marc's rebased "ITS cache" patches on top
> > > - 4 VCPU guest on a 4 core machine
> > > - passing through a M.2 NVMe SSD (or a USB3 controller) to the guest
> > > - running FIO in the guest, with:
> > >    - 4K block size, random reads, queue depth 16, 4 jobs (small)
> > >    - 1M block size, sequential reads, QD 1, 1 job (big)
> > > 
> > > For the NVMe disk I see a whopping 19% performance improvement with Marc's
> > > series (for the small blocks). For a SATA SSD connected via USB3.0 I still
> > > see 6% improvement. For NVMe there were 50,000 interrupts per second on
> > > the host, the USB3 setup came only up to 10,000/s. For big blocks (with
> > > IRQs in the low thousands/s) the win is less, but still a measurable
> > > 3%.
> > 
> > Thanks for having a go at this, and identifying the case where it
> > actually matters (I would have hoped that the original reporter would
> > have helped with this, but hey, never mind). The results are pretty
> > impressive (more so than I anticipated), and I wonder whether we could
> > improve things further (50k interrupts/s is not that high -- I get
> > more than 100k on some machines just by playing with their sdcard...).
> 
> I think the "original reporter" must feel embarrassed now.
> Actually, we had tested your patches (based on about 5.1.0-rc2) but
> failed to see performance improvement. And I stopped to move on, and
> then two months had gone... Oh sorry!
> 
> We retest your patches on 5.1.0, the result is as below.
> 
> Test setup:
> - GICv3 and ITS (on Taishan 2280, D05)
> - two 4-VCPU guests with vhost-net interface
> - run iperf in guests:
>     - guest1: iperf -s
>     - guest2: iperf -c guest1-IP -t 10
> - pin vcpu threads and vhost threads on the same NUMA node
> 
> Result:
> +-----------------+--------------+-----------------------+
> > 
> > Result          | interrupts/s | bandwidth (Gbits/sec) |
> +-----------------+--------------+-----------------------+
> > 
> > 5.1.0           |    25+ k     |    10.6 Gbits/sec     |
> +-----------------+--------------+-----------------------+
> > 
> > 5.1.0 (patched) |    40+ k     |    10.2 Gbits/sec     |
> +-----------------+--------------+-----------------------+
> 
> We get "interrupts/s" from /proc/interrupts on iperf server, with stable
> measured results. And we get "bandwidth" directly from iperf, but the
> results are somewhat *instable*. And the results really confused me --
> we received more interrupts but get a slight lower performance, why?
> 
> We configure the vhost-net interface with only one queue, so I think we
> can rule out the spin-lock influence. And 'perf lock' confirmed this.
> This is all that I can provide now, sorry if it's useless.
> 
> Also, one minor nit in code:
> In vgic_its_cache_translation(), we use vgic_put_irq() to evict the LRU
> cache entry, while we're already holding the lpi_list_lock. A deadlock
> will be caused here. But this is easy to fix.
> 
> 
> Anyway, we always have enough environments (e.g., D05, D06, ...) to do
> some tests. If you want to do further tests on our boards, please let me
> know :)

You actually need a little bit more control over the interrupt pinning in the 
guest and the interrupt pinning on the host. You also need to control vCPU 
pinning on pCPUs to have deterministic benchmark here.

I have a patch (which is not as polished as the one from Marc) that does direct 
interrupt injection and we do see roughly 20%-25% bandwidth increase. So yes, 
direct interrupt injection is absolutely needed.

Generally, if you have the host interrupt hitting a CPU different from the ones 
running the guest you see higher bandwidth (with current vanilla KVM). Once the 
host interrupts hit the same vCPU as the guest, not having this direct injection
path shows a huge drop in bandwidth.

So generally I would suggest to move forward with direct injection patch as it 
is really needed in platforms that does not have "posted interrupts".

> 
> 
> thanks,
> zenghui
> 
> > 
> > Could you describe how many interrupt sources each device has? The
> > reason I'm asking is that the cache size is pretty much hardcoded at
> > the moment (4 entries per vcpu), and that could have an impact on
> > performance if we keep evicting entries in the cache (note to self:
> > add some statistics for that).
> > 
> > Another area where we can improve things is that I think the
> > invalidation mechanism is pretty trigger happy (MOVI really doesn't
> > need to invalidate the cache). On the other hand, I'm not sure your
> > guest does too much of that.
> > 
> > Finally, the single cache spin-lock is bound to be a bottleneck of its
> > own at high interrupt rates, and I wonder whether we should move the
> > whole thing over to an RCU friendly data structure (the vgic_irq
> > structure really isn't that friendly). It'd be good to find out how
> > contended that spinlock is on your system.
> > 
> > > 
> > > Now that I have the setup, I can rerun experiments very quickly (given I
> > > don't loose access to the machine), so let me know if someone needs
> > > further tests.
> > 
> > Another useful data point would be the delta with bare-metal: how much
> > overhead do we have with KVM, with and without this patch series. Oh,
> > and for easier comparison, please write it as a table that we can dump
> > in the cover letter when I actually post the series! ;-)
> > 
> > Thanks,
> > 
> > 	M.
> > 
> 



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrer: Christian Schlaeger, Ralf Herbrich
Ust-ID: DE 289 237 879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

[RFC] KVM: arm/arm64: Enable direct irqfd MSI injection

Commit Message

Comments

Patch