[1/5] KVM: x86: avoid atomic operations on APICv vmentry

Message ID	1476469291-5039-2-git-send-email-pbonzini@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> From: Paolo Bonzini <pbonzini@redhat.com> To: linux-kernel@vger.kernel.org, kvm@vger.kernel.org Cc: rkrcmar@redhat.com, yang.zhang.wz@gmail.com, feng.wu@intel.com, mst@redhat.com Subject: [PATCH 1/5] KVM: x86: avoid atomic operations on APICv vmentry Date: Fri, 14 Oct 2016 20:21:27 +0200 Message-Id: <1476469291-5039-2-git-send-email-pbonzini@redhat.com> In-Reply-To: <1476469291-5039-1-git-send-email-pbonzini@redhat.com> References: <1476469291-5039-1-git-send-email-pbonzini@redhat.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Paolo Bonzini Oct. 14, 2016, 6:21 p.m. UTC

On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

              | enable_apicv=1  |  enable_apicv=0
              | mean     stdev  |  mean     stdev
    ----------|-----------------|------------------
    before    | 5826     32.65  |  5765     47.09
    after     | 5809     43.42  |  5777     77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/lapic.c | 6 ++++--
 arch/x86/kvm/vmx.c   | 9 ++++++++-
 2 files changed, 12 insertions(+), 3 deletions(-)

Nadav Amit Oct. 14, 2016, 6:50 p.m. UTC | #1

> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 23b99f305382..63a442aefc12 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> 	u32 i, pir_val;
> 
> 	for (i = 0; i <= 7; i++) {
> -		pir_val = xchg(&pir[i], 0);
> -		if (pir_val)
> +		pir_val = READ_ONCE(pir[i]);

Out of curiosity, do you really need this READ_ONCE?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Oct. 14, 2016, 6:56 p.m. UTC | #2

> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 23b99f305382..63a442aefc12 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -342,9 +342,11 @@ void __kvm_apic_update_irr(u32 *pir, void *regs)
> > 	u32 i, pir_val;
> > 
> > 	for (i = 0; i <= 7; i++) {
> > -		pir_val = xchg(&pir[i], 0);
> > -		if (pir_val)
> > +		pir_val = READ_ONCE(pir[i]);
> 
> Out of curiosity, do you really need this READ_ONCE?

The answer can only be "depends on the compiler's whims". :)
If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Nadav Amit Oct. 14, 2016, 7:44 p.m. UTC | #3

> On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>> 
>>> 	for (i = 0; i <= 7; i++) {
>>> -		pir_val = xchg(&pir[i], 0);
>>> -		if (pir_val)
>>> +		pir_val = READ_ONCE(pir[i]);
>> 
>> Out of curiosity, do you really need this READ_ONCE?
> 
> The answer can only be "depends on the compiler's whims". :)
> If you think of READ_ONCE as a C11 relaxed atomic load, then yes.

Hm.. So the idea is to make the code "race-free” in the sense
that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?

If that is the case, I think there are many other cases that need to be
changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Oct. 15, 2016, 7:47 a.m. UTC | #4

> > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> >>> 
> >>> 	for (i = 0; i <= 7; i++) {
> >>> -		pir_val = xchg(&pir[i], 0);
> >>> -		if (pir_val)
> >>> +		pir_val = READ_ONCE(pir[i]);
> >> 
> >> Out of curiosity, do you really need this READ_ONCE?
> > 
> > The answer can only be "depends on the compiler's whims". :)
> > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> 
> Hm.. So the idea is to make the code "race-free” in the sense
> that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> 
> If that is the case, I think there are many other cases that need to be
> changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.

There is no documentation for this in the kernel tree unfortunately.
But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
memory barriers is a start.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Oct. 16, 2016, 2:29 a.m. UTC | #5

On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> 
> > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > >>> 
> > >>> 	for (i = 0; i <= 7; i++) {
> > >>> -		pir_val = xchg(&pir[i], 0);
> > >>> -		if (pir_val)
> > >>> +		pir_val = READ_ONCE(pir[i]);
> > >> 
> > >> Out of curiosity, do you really need this READ_ONCE?
> > > 
> > > The answer can only be "depends on the compiler's whims". :)
> > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > 
> > Hm.. So the idea is to make the code "race-free” in the sense
> > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > 
> > If that is the case, I think there are many other cases that need to be
> > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> 
> There is no documentation for this in the kernel tree unfortunately.
> But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> memory barriers is a start.
> 
> Paolo

I'm beginning to think that if a value is always (maybe except for init
where we don't much care about the code size anyway) accessed through
*_ONCE macros, we should just mark it volatile and be done with it. The
code will look cleaner, and there will be less space for errors
like forgetting *_ONCE macros.

Would such code (where all accesses are done through
READ_ONCE/WRITE_ONCE otherwise) be an exception to
volatile-considered-harmful.txt rules?

Cc Paul and Jonathan (for volatile-considered-harmful.txt).

Paul E. McKenney Oct. 19, 2016, 11:45 a.m. UTC | #6

On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > 
> > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > >>> 
> > > >>> 	for (i = 0; i <= 7; i++) {
> > > >>> -		pir_val = xchg(&pir[i], 0);
> > > >>> -		if (pir_val)
> > > >>> +		pir_val = READ_ONCE(pir[i]);
> > > >> 
> > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > 
> > > > The answer can only be "depends on the compiler's whims". :)
> > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > 
> > > Hm.. So the idea is to make the code "race-free” in the sense
> > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > 
> > > If that is the case, I think there are many other cases that need to be
> > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > 
> > There is no documentation for this in the kernel tree unfortunately.
> > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > memory barriers is a start.
> > 
> > Paolo
> 
> I'm beginning to think that if a value is always (maybe except for init
> where we don't much care about the code size anyway) accessed through
> *_ONCE macros, we should just mark it volatile and be done with it. The
> code will look cleaner, and there will be less space for errors
> like forgetting *_ONCE macros.
> 
> Would such code (where all accesses are done through
> READ_ONCE/WRITE_ONCE otherwise) be an exception to
> volatile-considered-harmful.txt rules?
> 
> Cc Paul and Jonathan (for volatile-considered-harmful.txt).

One concern would be the guy reading the code, to whom it might not
be obvious that the underlying access was volatile, especially if
the reference was a few levels down.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Radim Krčmář Oct. 26, 2016, 7:53 p.m. UTC | #7

2016-10-14 20:21+0200, Paolo Bonzini:
> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> posted interrupts turn out to be slower than interrupt injection via
> KVM_REQ_EVENT.
> 
> This patch optimizes a bit the IRR update, avoiding expensive atomic
> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> 
>               | enable_apicv=1  |  enable_apicv=0
>               | mean     stdev  |  mean     stdev
>     ----------|-----------------|------------------
>     before    | 5826     32.65  |  5765     47.09
>     after     | 5809     43.42  |  5777     77.02
> 
> Of course, any change in the right column is just placebo effect. :)
> The savings are bigger if interrupts are frequent.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>  			(unsigned long *)&pi_desc->control);
>  }
>  
> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> +{
> +	clear_bit(POSTED_INTR_ON,
> +  		  (unsigned long *)&pi_desc->control);
   ^^
   bad whitespace.

> +}

We should add an explicit smp_mb__after_atomic() for extra correctness,
because clear_bit() does not guarantee a memory barrier and we must make
sure that pir reads can't be reordered before it.
x86 clear_bit() currently uses locked instruction, though.

> +
>  static inline int pi_test_on(struct pi_desc *pi_desc)
>  {
>  	return test_bit(POSTED_INTR_ON,

Other than that,

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Oct. 26, 2016, 9:42 p.m. UTC | #8

On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> 2016-10-14 20:21+0200, Paolo Bonzini:
> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> > posted interrupts turn out to be slower than interrupt injection via
> > KVM_REQ_EVENT.
> > 
> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> > 
> >               | enable_apicv=1  |  enable_apicv=0
> >               | mean     stdev  |  mean     stdev
> >     ----------|-----------------|------------------
> >     before    | 5826     32.65  |  5765     47.09
> >     after     | 5809     43.42  |  5777     77.02
> > 
> > Of course, any change in the right column is just placebo effect. :)
> > The savings are bigger if interrupts are frequent.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> >  			(unsigned long *)&pi_desc->control);
> >  }
> >  
> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> > +{
> > +	clear_bit(POSTED_INTR_ON,
> > +  		  (unsigned long *)&pi_desc->control);
>    ^^
>    bad whitespace.
> 
> > +}
> 
> We should add an explicit smp_mb__after_atomic() for extra correctness,
> because clear_bit() does not guarantee a memory barrier and we must make
> sure that pir reads can't be reordered before it.
> x86 clear_bit() currently uses locked instruction, though.

smp_mb__after_atomic is empty on x86 so it's
a documentation thing, not a correctness thing anyway.

> > +
> >  static inline int pi_test_on(struct pi_desc *pi_desc)
> >  {
> >  	return test_bit(POSTED_INTR_ON,
> 
> Other than that,
> 
> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Oct. 26, 2016, 9:50 p.m. UTC | #9

On Wed, Oct 19, 2016 at 04:45:48AM -0700, Paul E. McKenney wrote:
> On Sun, Oct 16, 2016 at 05:29:24AM +0300, Michael S. Tsirkin wrote:
> > On Sat, Oct 15, 2016 at 03:47:45AM -0400, Paolo Bonzini wrote:
> > > 
> > > > > On Oct 14, 2016, at 11:56 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > > >>> 
> > > > >>> 	for (i = 0; i <= 7; i++) {
> > > > >>> -		pir_val = xchg(&pir[i], 0);
> > > > >>> -		if (pir_val)
> > > > >>> +		pir_val = READ_ONCE(pir[i]);
> > > > >> 
> > > > >> Out of curiosity, do you really need this READ_ONCE?
> > > > > 
> > > > > The answer can only be "depends on the compiler's whims". :)
> > > > > If you think of READ_ONCE as a C11 relaxed atomic load, then yes.
> > > > 
> > > > Hm.. So the idea is to make the code "race-free” in the sense
> > > > that every concurrent memory access is done using READ_ONCE/WRITE_ONCE?
> > > > 
> > > > If that is the case, I think there are many other cases that need to be
> > > > changed, for example apic->irr_pending and vcpu->arch.pv.pv_unhalted.
> > > 
> > > There is no documentation for this in the kernel tree unfortunately.
> > > But yes, I think we should do that.  Using READ_ONCE/WRITE_ONCE around
> > > memory barriers is a start.
> > > 
> > > Paolo
> > 
> > I'm beginning to think that if a value is always (maybe except for init
> > where we don't much care about the code size anyway) accessed through
> > *_ONCE macros, we should just mark it volatile and be done with it. The
> > code will look cleaner, and there will be less space for errors
> > like forgetting *_ONCE macros.
> > 
> > Would such code (where all accesses are done through
> > READ_ONCE/WRITE_ONCE otherwise) be an exception to
> > volatile-considered-harmful.txt rules?
> > 
> > Cc Paul and Jonathan (for volatile-considered-harmful.txt).
> 
> One concern would be the guy reading the code, to whom it might not
> be obvious that the underlying access was volatile, especially if
> the reference was a few levels down.
> 
> 							Thanx, Paul

So the guy reading the code will think access is unsafe, check it up and
realise it's fine.  Is this a big deal?  Isn't it better than the
reverse where one forgets to use READ_ONCE and gets a runtime bug?

My point is that this text:

	The key point to understand with regard to volatile is that its purpose is
	to suppress optimization, which is almost never what one really wants to do.

doesn't seem to apply anymore since we have hundreds of users of
READ_ONCE the point of which is exactly to suppress optimization.

I'm guessing this was written when we only had barrier() which is quite
unlike READ_ONCE in that it's not tied to a specific memory reference.

Radim Krčmář Oct. 27, 2016, 4:44 p.m. UTC | #10

2016-10-27 00:42+0300, Michael S. Tsirkin:
> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> 2016-10-14 20:21+0200, Paolo Bonzini:
>> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> > posted interrupts turn out to be slower than interrupt injection via
>> > KVM_REQ_EVENT.
>> > 
>> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> > 
>> >               | enable_apicv=1  |  enable_apicv=0
>> >               | mean     stdev  |  mean     stdev
>> >     ----------|-----------------|------------------
>> >     before    | 5826     32.65  |  5765     47.09
>> >     after     | 5809     43.42  |  5777     77.02
>> > 
>> > Of course, any change in the right column is just placebo effect. :)
>> > The savings are bigger if interrupts are frequent.
>> > 
>> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> > ---
>> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> >  			(unsigned long *)&pi_desc->control);
>> >  }
>> >  
>> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> > +{
>> > +	clear_bit(POSTED_INTR_ON,
>> > +  		  (unsigned long *)&pi_desc->control);
>> > +}
>> 
>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> because clear_bit() does not guarantee a memory barrier and we must make
>> sure that pir reads can't be reordered before it.
>> x86 clear_bit() currently uses locked instruction, though.
> 
> smp_mb__after_atomic is empty on x86 so it's
> a documentation thing, not a correctness thing anyway.

All atomics currently contain a barrier, but the code is also
future-proofing, not just documentation: implementation of clear_bit()
could drop the barrier and smp_mb__after_atomic() would then become a
real barrier.

Adding dma_mb__after_atomic() would be even better as this bug could
happen even on a uniprocessor with an assigned device, but people who
buy a SMP chip to run a UP kernel deserve it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Oct. 27, 2016, 4:51 p.m. UTC | #11

On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >> 2016-10-14 20:21+0200, Paolo Bonzini:
> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >> > posted interrupts turn out to be slower than interrupt injection via
> >> > KVM_REQ_EVENT.
> >> > 
> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >> > 
> >> >               | enable_apicv=1  |  enable_apicv=0
> >> >               | mean     stdev  |  mean     stdev
> >> >     ----------|-----------------|------------------
> >> >     before    | 5826     32.65  |  5765     47.09
> >> >     after     | 5809     43.42  |  5777     77.02
> >> > 
> >> > Of course, any change in the right column is just placebo effect. :)
> >> > The savings are bigger if interrupts are frequent.
> >> > 
> >> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> > ---
> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> >> >  			(unsigned long *)&pi_desc->control);
> >> >  }
> >> >  
> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >> > +{
> >> > +	clear_bit(POSTED_INTR_ON,
> >> > +  		  (unsigned long *)&pi_desc->control);
> >> > +}
> >> 
> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >> because clear_bit() does not guarantee a memory barrier and we must make
> >> sure that pir reads can't be reordered before it.
> >> x86 clear_bit() currently uses locked instruction, though.
> > 
> > smp_mb__after_atomic is empty on x86 so it's
> > a documentation thing, not a correctness thing anyway.
> 
> All atomics currently contain a barrier, but the code is also
> future-proofing, not just documentation: implementation of clear_bit()
> could drop the barrier and smp_mb__after_atomic() would then become a
> real barrier.
> 
> Adding dma_mb__after_atomic() would be even better as this bug could
> happen even on a uniprocessor with an assigned device, but people who
> buy a SMP chip to run a UP kernel deserve it.

Not doing dma so does not seem to make sense ...
Why do you need a barrier on a UP kernel?

Radim Krčmář Oct. 27, 2016, 5:06 p.m. UTC | #12

2016-10-27 19:51+0300, Michael S. Tsirkin:
> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>> > On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>> >> 2016-10-14 20:21+0200, Paolo Bonzini:
>> >> > On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>> >> > posted interrupts turn out to be slower than interrupt injection via
>> >> > KVM_REQ_EVENT.
>> >> > 
>> >> > This patch optimizes a bit the IRR update, avoiding expensive atomic
>> >> > operations in the common case where PI.ON=0 at vmentry or the PIR vector
>> >> > is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>> >> > measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>> >> > 
>> >> >               | enable_apicv=1  |  enable_apicv=0
>> >> >               | mean     stdev  |  mean     stdev
>> >> >     ----------|-----------------|------------------
>> >> >     before    | 5826     32.65  |  5765     47.09
>> >> >     after     | 5809     43.42  |  5777     77.02
>> >> > 
>> >> > Of course, any change in the right column is just placebo effect. :)
>> >> > The savings are bigger if interrupts are frequent.
>> >> > 
>> >> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> >> > ---
>> >> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> >> > @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>> >> >  			(unsigned long *)&pi_desc->control);
>> >> >  }
>> >> >  
>> >> > +static inline void pi_clear_on(struct pi_desc *pi_desc)
>> >> > +{
>> >> > +	clear_bit(POSTED_INTR_ON,
>> >> > +  		  (unsigned long *)&pi_desc->control);
>> >> > +}
>> >> 
>> >> We should add an explicit smp_mb__after_atomic() for extra correctness,
>> >> because clear_bit() does not guarantee a memory barrier and we must make
>> >> sure that pir reads can't be reordered before it.
>> >> x86 clear_bit() currently uses locked instruction, though.
>> > 
>> > smp_mb__after_atomic is empty on x86 so it's
>> > a documentation thing, not a correctness thing anyway.
>> 
>> All atomics currently contain a barrier, but the code is also
>> future-proofing, not just documentation: implementation of clear_bit()
>> could drop the barrier and smp_mb__after_atomic() would then become a
>> real barrier.
>> 
>> Adding dma_mb__after_atomic() would be even better as this bug could
>> happen even on a uniprocessor with an assigned device, but people who
>> buy a SMP chip to run a UP kernel deserve it.
> 
> Not doing dma so does not seem to make sense ...

IOMMU does -- it writes to the PIR and sets ON asynchronously.

> Why do you need a barrier on a UP kernel?

If pi_clear_on() doesn't contain a memory barrier (possible future),
then we have the following race: (pir[0] begins as 0.)

    KVM                           |  IOMMU
   -------------------------------+-------------
   pir_val = ACCESS_ONCE(pir[0])  |
                                  | pir[0] = 123
                                  | pi_set_on()
   pi_clear_on()                  |
   if (pir_val)                   |

ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
does nothing in this patch), so if there was 0 in pir[0] before IOMMU
wrote to it, then our optimization to avoid the xchg would yield a false
negative and the interrupt would be lost.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Paolo Bonzini Oct. 28, 2016, 9:39 a.m. UTC | #13

On 27/10/2016 19:06, Radim Krčmář wrote:
> 2016-10-27 19:51+0300, Michael S. Tsirkin:
>> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
>>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
>>>> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
>>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
>>>>>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
>>>>>> posted interrupts turn out to be slower than interrupt injection via
>>>>>> KVM_REQ_EVENT.
>>>>>>
>>>>>> This patch optimizes a bit the IRR update, avoiding expensive atomic
>>>>>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
>>>>>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
>>>>>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
>>>>>>
>>>>>>               | enable_apicv=1  |  enable_apicv=0
>>>>>>               | mean     stdev  |  mean     stdev
>>>>>>     ----------|-----------------|------------------
>>>>>>     before    | 5826     32.65  |  5765     47.09
>>>>>>     after     | 5809     43.42  |  5777     77.02
>>>>>>
>>>>>> Of course, any change in the right column is just placebo effect. :)
>>>>>> The savings are bigger if interrupts are frequent.
>>>>>>
>>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>>> ---
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
>>>>>>  			(unsigned long *)&pi_desc->control);
>>>>>>  }
>>>>>>  
>>>>>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
>>>>>> +{
>>>>>> +	clear_bit(POSTED_INTR_ON,
>>>>>> +  		  (unsigned long *)&pi_desc->control);
>>>>>> +}
>>>>>
>>>>> We should add an explicit smp_mb__after_atomic() for extra correctness,
>>>>> because clear_bit() does not guarantee a memory barrier and we must make
>>>>> sure that pir reads can't be reordered before it.
>>>>> x86 clear_bit() currently uses locked instruction, though.
>>>>
>>>> smp_mb__after_atomic is empty on x86 so it's
>>>> a documentation thing, not a correctness thing anyway.
>>>
>>> All atomics currently contain a barrier, but the code is also
>>> future-proofing, not just documentation: implementation of clear_bit()
>>> could drop the barrier and smp_mb__after_atomic() would then become a
>>> real barrier.
>>>
>>> Adding dma_mb__after_atomic() would be even better as this bug could
>>> happen even on a uniprocessor with an assigned device, but people who
>>> buy a SMP chip to run a UP kernel deserve it.
>>
>> Not doing dma so does not seem to make sense ...
> 
> IOMMU does -- it writes to the PIR and sets ON asynchronously.

I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
difference is documentation only, since all of them are
compiler-barriers only on x86.

Preferences?

Thanks,

Paolo

>> Why do you need a barrier on a UP kernel?
> 
> If pi_clear_on() doesn't contain a memory barrier (possible future),
> then we have the following race: (pir[0] begins as 0.)
> 
>     KVM                           |  IOMMU
>    -------------------------------+-------------
>    pir_val = ACCESS_ONCE(pir[0])  |
>                                   | pir[0] = 123
>                                   | pi_set_on()
>    pi_clear_on()                  |
>    if (pir_val)                   |
> 
> ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> wrote to it, then our optimization to avoid the xchg would yield a false
> negative and the interrupt would be lost.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Michael S. Tsirkin Oct. 28, 2016, 10:04 p.m. UTC | #14

On Fri, Oct 28, 2016 at 11:39:44AM +0200, Paolo Bonzini wrote:
> 
> 
> On 27/10/2016 19:06, Radim Krčmář wrote:
> > 2016-10-27 19:51+0300, Michael S. Tsirkin:
> >> On Thu, Oct 27, 2016 at 06:44:00PM +0200, Radim Krčmář wrote:
> >>> 2016-10-27 00:42+0300, Michael S. Tsirkin:
> >>>> On Wed, Oct 26, 2016 at 09:53:45PM +0200, Radim Krčmář wrote:
> >>>>> 2016-10-14 20:21+0200, Paolo Bonzini:
> >>>>>> On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
> >>>>>> posted interrupts turn out to be slower than interrupt injection via
> >>>>>> KVM_REQ_EVENT.
> >>>>>>
> >>>>>> This patch optimizes a bit the IRR update, avoiding expensive atomic
> >>>>>> operations in the common case where PI.ON=0 at vmentry or the PIR vector
> >>>>>> is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
> >>>>>> measured by kvm-unit-tests' inl_from_qemu test (20 runs):
> >>>>>>
> >>>>>>               | enable_apicv=1  |  enable_apicv=0
> >>>>>>               | mean     stdev  |  mean     stdev
> >>>>>>     ----------|-----------------|------------------
> >>>>>>     before    | 5826     32.65  |  5765     47.09
> >>>>>>     after     | 5809     43.42  |  5777     77.02
> >>>>>>
> >>>>>> Of course, any change in the right column is just placebo effect. :)
> >>>>>> The savings are bigger if interrupts are frequent.
> >>>>>>
> >>>>>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >>>>>> ---
> >>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>>>> @@ -521,6 +521,12 @@ static inline void pi_set_sn(struct pi_desc *pi_desc)
> >>>>>>  			(unsigned long *)&pi_desc->control);
> >>>>>>  }
> >>>>>>  
> >>>>>> +static inline void pi_clear_on(struct pi_desc *pi_desc)
> >>>>>> +{
> >>>>>> +	clear_bit(POSTED_INTR_ON,
> >>>>>> +  		  (unsigned long *)&pi_desc->control);
> >>>>>> +}
> >>>>>
> >>>>> We should add an explicit smp_mb__after_atomic() for extra correctness,
> >>>>> because clear_bit() does not guarantee a memory barrier and we must make
> >>>>> sure that pir reads can't be reordered before it.
> >>>>> x86 clear_bit() currently uses locked instruction, though.
> >>>>
> >>>> smp_mb__after_atomic is empty on x86 so it's
> >>>> a documentation thing, not a correctness thing anyway.
> >>>
> >>> All atomics currently contain a barrier, but the code is also
> >>> future-proofing, not just documentation: implementation of clear_bit()
> >>> could drop the barrier and smp_mb__after_atomic() would then become a
> >>> real barrier.
> >>>
> >>> Adding dma_mb__after_atomic() would be even better as this bug could
> >>> happen even on a uniprocessor with an assigned device, but people who
> >>> buy a SMP chip to run a UP kernel deserve it.
> >>
> >> Not doing dma so does not seem to make sense ...
> > 
> > IOMMU does -- it writes to the PIR and sets ON asynchronously.
> 
> I can use either __smp_mb__after_atomic or virt_mb__after_atomic.  The
> difference is documentation only, since all of them are
> compiler-barriers only on x86.

A comment is also an option.

> Preferences?
> 
> Thanks,
> 
> Paolo

virt_ is for a VM guest. Pls don't use for host side code.
I thought it's clear enough but maybe I should add
more documentation.

> >> Why do you need a barrier on a UP kernel?
> > 
> > If pi_clear_on() doesn't contain a memory barrier (possible future),
> > then we have the following race: (pir[0] begins as 0.)
> > 
> >     KVM                           |  IOMMU
> >    -------------------------------+-------------
> >    pir_val = ACCESS_ONCE(pir[0])  |
> >                                   | pir[0] = 123
> >                                   | pi_set_on()
> >    pi_clear_on()                  |
> >    if (pir_val)                   |
> > 
> > ACCESS_ONCE() does not prevent the CPU to prefetch pir[0] (ACCESS_ONCE
> > does nothing in this patch), so if there was 0 in pir[0] before IOMMU
> > wrote to it, then our optimization to avoid the xchg would yield a false
> > negative and the interrupt would be lost.
> > 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[1/5] KVM: x86: avoid atomic operations on APICv vmentry

Commit Message

Comments

Patch