diff mbox series

KVM: x86: Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI)

Message ID 20230927040939.342643-1-mizhang@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI) | expand

Commit Message

Mingwei Zhang Sept. 27, 2023, 4:09 a.m. UTC
Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
When vPMU is active use, processing each KVM_REQ_PMI will generate a
KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
guest enter, jump to kvm_x86_cancel_injection(), and re-enter
vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
vPMU as well as the virtualization.

So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
runloop more efficient with vPMU.

To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
uses. In addition, we use SPEC2017 benchmark programs as the workload with
the setup of using single core, single thread.

At the host level, we probe the invocations to vmx_cancel_injection() with
the following command:

    $ perf probe -a vmx_cancel_injection
    $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds

The following is the result that we collected at beginning of the spec2017
benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
the incompleteness.

On kernel without the change:
    10.010018010              14254      probe:vmx_cancel_injection
    20.037646388              15207      probe:vmx_cancel_injection
    30.078739816              15261      probe:vmx_cancel_injection
    40.114033258              15085      probe:vmx_cancel_injection
    50.149297460              15112      probe:vmx_cancel_injection
    60.185103088              15104      probe:vmx_cancel_injection

On kernel with the change:
    10.003595390                 40      probe:vmx_cancel_injection
    20.017855682                 31      probe:vmx_cancel_injection
    30.028355883                 34      probe:vmx_cancel_injection
    40.038686298                 31      probe:vmx_cancel_injection
    50.048795162                 20      probe:vmx_cancel_injection
    60.069057747                 19      probe:vmx_cancel_injection

From the above, it is clear that we save 1500 invocations per vcpu per
second to vmx_cancel_injection() for workloads like perlbench.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


base-commit: 73554b29bd70546c1a9efc9c160641ef1b849358

Comments

Mingwei Zhang Sept. 27, 2023, 4:15 a.m. UTC | #1
ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
Sorry about that.

On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
> When vPMU is active use, processing each KVM_REQ_PMI will generate a
> KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
> guest enter, jump to kvm_x86_cancel_injection(), and re-enter
> vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
> vPMU as well as the virtualization.
>
> So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
> runloop more efficient with vPMU.
>
> To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
> an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
> uses. In addition, we use SPEC2017 benchmark programs as the workload with
> the setup of using single core, single thread.
>
> At the host level, we probe the invocations to vmx_cancel_injection() with
> the following command:
>
>     $ perf probe -a vmx_cancel_injection
>     $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
>
> The following is the result that we collected at beginning of the spec2017
> benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
> the incompleteness.
>
> On kernel without the change:
>     10.010018010              14254      probe:vmx_cancel_injection
>     20.037646388              15207      probe:vmx_cancel_injection
>     30.078739816              15261      probe:vmx_cancel_injection
>     40.114033258              15085      probe:vmx_cancel_injection
>     50.149297460              15112      probe:vmx_cancel_injection
>     60.185103088              15104      probe:vmx_cancel_injection
>
> On kernel with the change:
>     10.003595390                 40      probe:vmx_cancel_injection
>     20.017855682                 31      probe:vmx_cancel_injection
>     30.028355883                 34      probe:vmx_cancel_injection
>     40.038686298                 31      probe:vmx_cancel_injection
>     50.048795162                 20      probe:vmx_cancel_injection
>     60.069057747                 19      probe:vmx_cancel_injection
>
> From the above, it is clear that we save 1500 invocations per vcpu per
> second to vmx_cancel_injection() for workloads like perlbench.
>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 42a4e8f5e89a..302b6f8ddfb1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10580,12 +10580,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>                 if (kvm_check_request(KVM_REQ_SMI, vcpu))
>                         process_smi(vcpu);
>  #endif
> -               if (kvm_check_request(KVM_REQ_NMI, vcpu))
> -                       process_nmi(vcpu);
>                 if (kvm_check_request(KVM_REQ_PMU, vcpu))
>                         kvm_pmu_handle_event(vcpu);
>                 if (kvm_check_request(KVM_REQ_PMI, vcpu))
>                         kvm_pmu_deliver_pmi(vcpu);
> +               if (kvm_check_request(KVM_REQ_NMI, vcpu))
> +                       process_nmi(vcpu);
>                 if (kvm_check_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu)) {
>                         BUG_ON(vcpu->arch.pending_ioapic_eoi > 255);
>                         if (test_bit(vcpu->arch.pending_ioapic_eoi,
>
> base-commit: 73554b29bd70546c1a9efc9c160641ef1b849358
> --
> 2.42.0.515.g380fc7ccd1-goog
>
Xin Li Sept. 27, 2023, 5:04 a.m. UTC | #2
On 9/26/2023 9:15 PM, Mingwei Zhang wrote:
> ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
> Sorry about that.
> 
> On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@google.com> wrote:
>>
>> Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).

Please remove it, no need to repeat the subject.

>> When vPMU is active use, processing each KVM_REQ_PMI will generate a
>> KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
>> guest enter, jump to kvm_x86_cancel_injection(), and re-enter
>> vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
>> vPMU as well as the virtualization.

Optimization is after correctness, so please explain if this is correct
first!

>>
>> So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
>> runloop more efficient with vPMU.
>>
>> To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
>> an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
>> uses. In addition, we use SPEC2017 benchmark programs as the workload with
>> the setup of using single core, single thread.
>>
>> At the host level, we probe the invocations to vmx_cancel_injection() with
>> the following command:
>>
>>      $ perf probe -a vmx_cancel_injection
>>      $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
>>
>> The following is the result that we collected at beginning of the spec2017
>> benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
>> the incompleteness.
>>
>> On kernel without the change:
>>      10.010018010              14254      probe:vmx_cancel_injection
>>      20.037646388              15207      probe:vmx_cancel_injection
>>      30.078739816              15261      probe:vmx_cancel_injection
>>      40.114033258              15085      probe:vmx_cancel_injection
>>      50.149297460              15112      probe:vmx_cancel_injection
>>      60.185103088              15104      probe:vmx_cancel_injection
>>
>> On kernel with the change:
>>      10.003595390                 40      probe:vmx_cancel_injection
>>      20.017855682                 31      probe:vmx_cancel_injection
>>      30.028355883                 34      probe:vmx_cancel_injection
>>      40.038686298                 31      probe:vmx_cancel_injection
>>      50.048795162                 20      probe:vmx_cancel_injection
>>      60.069057747                 19      probe:vmx_cancel_injection
>>
>>  From the above, it is clear that we save 1500 invocations per vcpu per
>> second to vmx_cancel_injection() for workloads like perlbench.
>>
>> Signed-off-by: Mingwei Zhang <mizhang@google.com>
>> ---
>>   arch/x86/kvm/x86.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 42a4e8f5e89a..302b6f8ddfb1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10580,12 +10580,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>>                  if (kvm_check_request(KVM_REQ_SMI, vcpu))
>>                          process_smi(vcpu);
>>   #endif
>> -               if (kvm_check_request(KVM_REQ_NMI, vcpu))
>> -                       process_nmi(vcpu);
>>                  if (kvm_check_request(KVM_REQ_PMU, vcpu))
>>                          kvm_pmu_handle_event(vcpu);
>>                  if (kvm_check_request(KVM_REQ_PMI, vcpu))
>>                          kvm_pmu_deliver_pmi(vcpu);
>> +               if (kvm_check_request(KVM_REQ_NMI, vcpu))
>> +                       process_nmi(vcpu);
>>                  if (kvm_check_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu)) {
>>                          BUG_ON(vcpu->arch.pending_ioapic_eoi > 255);
>>                          if (test_bit(vcpu->arch.pending_ioapic_eoi,
>>
>> base-commit: 73554b29bd70546c1a9efc9c160641ef1b849358
>> --
>> 2.42.0.515.g380fc7ccd1-goog
>>
>
Sean Christopherson Sept. 27, 2023, 4:10 p.m. UTC | #3
On Tue, Sep 26, 2023, Xin Li wrote:
> On 9/26/2023 9:15 PM, Mingwei Zhang wrote:
> > ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
> > Sorry about that.
> > 
> > On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > 
> > > Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
> 
> Please remove it, no need to repeat the subject.

Heh, from Documentation/process/maintainer-kvm-x86.rst:

  Changelog
  ~~~~~~~~~
  Most importantly, write changelogs using imperative mood and avoid pronouns.

  See :ref:`describe_changes` for more information, with one amendment: lead with
  a short blurb on the actual changes, and then follow up with the context and
  background.  Note!  This order directly conflicts with the tip tree's preferred
  approach!  Please follow the tip tree's preferred style when sending patches
  that primarily target arch/x86 code that is _NOT_ KVM code.

That said, I do prefer that the changelog intro isn't just a copy+paste of the
shortlog, and the shortlog and changelog should use conversational language instead
of describing the literal code movement.

> > > When vPMU is active use, processing each KVM_REQ_PMI will generate a

This is not guaranteed.

> > > KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
> > > guest enter, jump to kvm_x86_cancel_injection(), and re-enter
> > > vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
> > > vPMU as well as the virtualization.

As above, use conversational language, the changelog isn't meant to be a play-by-play.

E.g.

  KVM: x86: Service NMI requests *after* PMI requests in VM-Enter path

  Move the handling of NMI requests after PMI requests in the VM-Enter path
  so that KVM doesn't need to cancel and redo VM-Enter in the likely
  scenario that the vCPU has configured its LVPTC entry to generate an NMI.

  Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
  requests after NMI requests means KVM won't detect the pending NMI request
  until the final check for outstanding requests.  Detecting requests at the
  final stage is costly as KVM has already loaded guest state, potentially
  queued events for injection, disabled IRQs, dropped SRCU, etc., most of
  which needs to be unwound.

> Optimization is after correctness, so please explain if this is correct
> first!

Not first.  Leading with an in-depth description of KVM requests and NMI handling
is not going to help understand *why* this change is being first.  But I do agree
that this should provide an analysis of why it's ok to swap the order, specificially
why it's architecturally ok if KVM drops an NMI due to the swapped ordering, e.g.
if the PMI is coincident with two other NMIs (or one other NMI and NMIs are blocked).

> > > So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
> > > runloop more efficient with vPMU.
> > > 
> > > To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on

Avoid pronouns.  There's no need for all the "fluff", just state the setup, the
test, and the results.

Really getting into the nits, but the whole "8-vcpu QEMU VM" versus
"the setup of using single core, single thread" is confusing IMO.  If there were
potential performance downsides and/or tradeoffs, then getting the gory details
might be necessary, but that's not the case here, and if it were really necessary
to drill down that deep, then I would want to better quantify the impact, e.g. in
terms latency.

  E.g. on Intel SPR running SPEC2017 benchmark and Intel vtune in the guest,
  handling PMI requests before NMI requests reduces the number of canceled
  runs by ~1500 per second, per vCPU (counted by probing calls to
  vmx_cancel_injection()).

> > > an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
> > > uses. In addition, we use SPEC2017 benchmark programs as the workload with
> > > the setup of using single core, single thread.
> > > 
> > > At the host level, we probe the invocations to vmx_cancel_injection() with
> > > the following command:
> > > 
> > >      $ perf probe -a vmx_cancel_injection
> > >      $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
> > > 
> > > The following is the result that we collected at beginning of the spec2017
> > > benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
> > > the incompleteness.
> > > 
> > > On kernel without the change:
> > >      10.010018010              14254      probe:vmx_cancel_injection
> > >      20.037646388              15207      probe:vmx_cancel_injection
> > >      30.078739816              15261      probe:vmx_cancel_injection
> > >      40.114033258              15085      probe:vmx_cancel_injection
> > >      50.149297460              15112      probe:vmx_cancel_injection
> > >      60.185103088              15104      probe:vmx_cancel_injection
> > > 
> > > On kernel with the change:
> > >      10.003595390                 40      probe:vmx_cancel_injection
> > >      20.017855682                 31      probe:vmx_cancel_injection
> > >      30.028355883                 34      probe:vmx_cancel_injection
> > >      40.038686298                 31      probe:vmx_cancel_injection
> > >      50.048795162                 20      probe:vmx_cancel_injection
> > >      60.069057747                 19      probe:vmx_cancel_injection
> > > 
> > >  From the above, it is clear that we save 1500 invocations per vcpu per
> > > second to vmx_cancel_injection() for workloads like perlbench.

Nit, this really should have:

  Suggested-by: Sean Christopherson <seanjc@google.com>

I personally don't care about the attribution, but (a) others often do care and
(b) the added context is helpful.  E.g. for bad/questionable suggestsions/ideas,
knowing that person X was also involved helps direct and/or curate questions/comments
accordingly.
Mingwei Zhang Sept. 27, 2023, 6:23 p.m. UTC | #4
On Wed, Sep 27, 2023 at 9:10 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Sep 26, 2023, Xin Li wrote:
> > On 9/26/2023 9:15 PM, Mingwei Zhang wrote:
> > > ah, typo in the subject: The 2nd KVM_REQ_NMI should be KVM_REQ_PMI.
> > > Sorry about that.
> > >
> > > On Tue, Sep 26, 2023 at 9:09 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > >
> > > > Move kvm_check_request(KVM_REQ_NMI) after kvm_check_request(KVM_REQ_NMI).
> >
> > Please remove it, no need to repeat the subject.
>
> Heh, from Documentation/process/maintainer-kvm-x86.rst:
>
>   Changelog
>   ~~~~~~~~~
>   Most importantly, write changelogs using imperative mood and avoid pronouns.
>
>   See :ref:`describe_changes` for more information, with one amendment: lead with
>   a short blurb on the actual changes, and then follow up with the context and
>   background.  Note!  This order directly conflicts with the tip tree's preferred
>   approach!  Please follow the tip tree's preferred style when sending patches
>   that primarily target arch/x86 code that is _NOT_ KVM code.
>
> That said, I do prefer that the changelog intro isn't just a copy+paste of the
> shortlog, and the shortlog and changelog should use conversational language instead
> of describing the literal code movement.
>
> > > > When vPMU is active use, processing each KVM_REQ_PMI will generate a
>
> This is not guaranteed.
>
> > > > KVM_REQ_NMI. Existing control flow after KVM_REQ_PMI finished will fail the
> > > > guest enter, jump to kvm_x86_cancel_injection(), and re-enter
> > > > vcpu_enter_guest(), this wasted lot of cycles and increase the overhead for
> > > > vPMU as well as the virtualization.
>
> As above, use conversational language, the changelog isn't meant to be a play-by-play.
>
> E.g.
>
>   KVM: x86: Service NMI requests *after* PMI requests in VM-Enter path
>
>   Move the handling of NMI requests after PMI requests in the VM-Enter path
>   so that KVM doesn't need to cancel and redo VM-Enter in the likely
>   scenario that the vCPU has configured its LVPTC entry to generate an NMI.
>
>   Because APIC emulation "injects" NMIs via KVM_REQ_NMI, handling PMI
>   requests after NMI requests means KVM won't detect the pending NMI request
>   until the final check for outstanding requests.  Detecting requests at the
>   final stage is costly as KVM has already loaded guest state, potentially
>   queued events for injection, disabled IRQs, dropped SRCU, etc., most of
>   which needs to be unwound.
>
> > Optimization is after correctness, so please explain if this is correct
> > first!
>
> Not first.  Leading with an in-depth description of KVM requests and NMI handling
> is not going to help understand *why* this change is being first.  But I do agree
> that this should provide an analysis of why it's ok to swap the order, specificially
> why it's architecturally ok if KVM drops an NMI due to the swapped ordering, e.g.
> if the PMI is coincident with two other NMIs (or one other NMI and NMIs are blocked).
>
> > > > So move the code snippet of kvm_check_request(KVM_REQ_NMI) to make KVM
> > > > runloop more efficient with vPMU.
> > > >
> > > > To evaluate the effectiveness of this change, we launch a 8-vcpu QEMU VM on
>
> Avoid pronouns.  There's no need for all the "fluff", just state the setup, the
> test, and the results.
>
> Really getting into the nits, but the whole "8-vcpu QEMU VM" versus
> "the setup of using single core, single thread" is confusing IMO.  If there were
> potential performance downsides and/or tradeoffs, then getting the gory details
> might be necessary, but that's not the case here, and if it were really necessary
> to drill down that deep, then I would want to better quantify the impact, e.g. in
> terms latency.
>
>   E.g. on Intel SPR running SPEC2017 benchmark and Intel vtune in the guest,
>   handling PMI requests before NMI requests reduces the number of canceled
>   runs by ~1500 per second, per vCPU (counted by probing calls to
>   vmx_cancel_injection()).
>
> > > > an Intel SPR CPU. In the VM, we run perf with all 48 events Intel vtune
> > > > uses. In addition, we use SPEC2017 benchmark programs as the workload with
> > > > the setup of using single core, single thread.
> > > >
> > > > At the host level, we probe the invocations to vmx_cancel_injection() with
> > > > the following command:
> > > >
> > > >      $ perf probe -a vmx_cancel_injection
> > > >      $ perf stat -a -e probe:vmx_cancel_injection -I 10000 # per 10 seconds
> > > >
> > > > The following is the result that we collected at beginning of the spec2017
> > > > benchmark run (so mostly for 500.perlbench_r in spec2017). Kindly forgive
> > > > the incompleteness.
> > > >
> > > > On kernel without the change:
> > > >      10.010018010              14254      probe:vmx_cancel_injection
> > > >      20.037646388              15207      probe:vmx_cancel_injection
> > > >      30.078739816              15261      probe:vmx_cancel_injection
> > > >      40.114033258              15085      probe:vmx_cancel_injection
> > > >      50.149297460              15112      probe:vmx_cancel_injection
> > > >      60.185103088              15104      probe:vmx_cancel_injection
> > > >
> > > > On kernel with the change:
> > > >      10.003595390                 40      probe:vmx_cancel_injection
> > > >      20.017855682                 31      probe:vmx_cancel_injection
> > > >      30.028355883                 34      probe:vmx_cancel_injection
> > > >      40.038686298                 31      probe:vmx_cancel_injection
> > > >      50.048795162                 20      probe:vmx_cancel_injection
> > > >      60.069057747                 19      probe:vmx_cancel_injection
> > > >
> > > >  From the above, it is clear that we save 1500 invocations per vcpu per
> > > > second to vmx_cancel_injection() for workloads like perlbench.
>
> Nit, this really should have:
>
>   Suggested-by: Sean Christopherson <seanjc@google.com>
>
> I personally don't care about the attribution, but (a) others often do care and
> (b) the added context is helpful.  E.g. for bad/questionable suggestsions/ideas,
> knowing that person X was also involved helps direct and/or curate questions/comments
> accordingly.

For sure! I will also pay more attention to that in the future.

Thanks.
-Mingwei
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42a4e8f5e89a..302b6f8ddfb1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10580,12 +10580,12 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_SMI, vcpu))
 			process_smi(vcpu);
 #endif
-		if (kvm_check_request(KVM_REQ_NMI, vcpu))
-			process_nmi(vcpu);
 		if (kvm_check_request(KVM_REQ_PMU, vcpu))
 			kvm_pmu_handle_event(vcpu);
 		if (kvm_check_request(KVM_REQ_PMI, vcpu))
 			kvm_pmu_deliver_pmi(vcpu);
+		if (kvm_check_request(KVM_REQ_NMI, vcpu))
+			process_nmi(vcpu);
 		if (kvm_check_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu)) {
 			BUG_ON(vcpu->arch.pending_ioapic_eoi > 255);
 			if (test_bit(vcpu->arch.pending_ioapic_eoi,