diff mbox series

[12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers

Message ID 20190928172323.14663-13-aarcange@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM monolithic v2 | expand

Commit Message

Andrea Arcangeli Sept. 28, 2019, 5:23 p.m. UTC
It's enough to check the exit value and issue a direct call to avoid
the retpoline for all the common vmexit reasons.

Reducing this list to only EXIT_REASON_MSR_WRITE,
EXIT_REASON_PREEMPTION_TIMER, EXIT_REASON_EPT_MISCONFIG,
EXIT_REASON_IO_INSTRUCTION increases the computation time of the
hrtimer guest testcase on Haswell i5-4670T CPU @ 2.30GHz by 7% with
the default spectre v2 mitigation enabled in the host and guest. On
skylake as opposed there's no measurable difference with the short
list. To put things in prospective on Haswell the same hrtimer
workload (note: it never calls cpuid and it never attempts to trigger
more vmexit on purpose) in guest takes 16.3% longer to compute on
upstream KVM running in the host than with the KVM mono v1 patchset
applied to the host kernel, while on skylake the same takes only 5.4%
more time (both with the default mitigations enabled in guest and
host).

It's also unclear why EXIT_REASON_IO_INSTRUCTION should be included.

Of course CONFIG_RETPOLINE already forbids gcc not to do indirect
jumps while compiling all switch() statements, however switch() would
still allow the compiler to bisect the value, however it seems to run
slower if something and the reason is that it's better to prioritize
and do the minimal possible number of checks for the most common vmexit.

The halt and pause loop exiting may be slow paths from the point of
the guest, but not necessarily so from the point of the host. There
can be a flood of halt exit reasons (in fact that's why the cpuidle
guest haltpoll support was recently merged and we can't rely on it
here because there are older kernels and other OS that must also
perform optimally). All it takes is a pipe ping pong with a different
host CPU and the host CPUs running at full capacity.

The same consideration applies to the pause loop exiting exit reason,
if there's heavy host overcommit that collides heavily in a spinlock
the same may happen.

In the common case of a fully idle host, the halt and pause loop
exiting can't help, but adding them doesn't hurt the common case and
the expectation here is that if they would ever become measurable, it
would be because they are increasing (and not decreasing) performance.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini Oct. 15, 2019, 8:28 a.m. UTC | #1
On 28/09/19 19:23, Andrea Arcangeli wrote:
> Reducing this list to only EXIT_REASON_MSR_WRITE,
> EXIT_REASON_PREEMPTION_TIMER, EXIT_REASON_EPT_MISCONFIG,
> EXIT_REASON_IO_INSTRUCTION increases the computation time of the
> hrtimer guest testcase on Haswell i5-4670T CPU @ 2.30GHz by 7% with
> the default spectre v2 mitigation enabled in the host and guest. On
> skylake as opposed there's no measurable difference with the short
> list. To put things in prospective on Haswell the same hrtimer
> workload (note: it never calls cpuid and it never attempts to trigger
> more vmexit on purpose) in guest takes 16.3% longer to compute on
> upstream KVM running in the host than with the KVM mono v1 patchset
> applied to the host kernel, while on skylake the same takes only 5.4%
> more time (both with the default mitigations enabled in guest and
> host).
> 
> It's also unclear why EXIT_REASON_IO_INSTRUCTION should be included.

If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
that are in the guest, the doorbell register might be MMIO or PIO.

> +		if (exit_reason == EXIT_REASON_MSR_WRITE)
> +			return kvm_emulate_wrmsr(vcpu);
> +		else if (exit_reason == EXIT_REASON_PREEMPTION_TIMER)
> +			return handle_preemption_timer(vcpu);
> +		else if (exit_reason == EXIT_REASON_PENDING_INTERRUPT)
> +			return handle_interrupt_window(vcpu);
> +		else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
> +			return handle_external_interrupt(vcpu);
> +		else if (exit_reason == EXIT_REASON_HLT)
> +			return kvm_emulate_halt(vcpu);
> +		else if (exit_reason == EXIT_REASON_PAUSE_INSTRUCTION)
> +			return handle_pause(vcpu);
> +		else if (exit_reason == EXIT_REASON_MSR_READ)
> +			return kvm_emulate_rdmsr(vcpu);
> +		else if (exit_reason == EXIT_REASON_CPUID)
> +			return kvm_emulate_cpuid(vcpu);
> +		else if (exit_reason == EXIT_REASON_EPT_MISCONFIG)
> +			return handle_ept_misconfig(vcpu);

So, the difference between my suggested list (which I admit is just
based on conjecture, not benchmarking) is that you add
EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT,
EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ,
EXIT_REASON_CPUID.

Which of these make a difference for the hrtimer testcase?  It's of
course totally fine to use benchmarks to prove that my intuition was
bad---but you must also use them to show why your intuition is right. :)

Paolo
Andrea Arcangeli Oct. 15, 2019, 4:49 p.m. UTC | #2
On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote:
> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
> should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
> that are in the guest, the doorbell register might be MMIO or PIO.

The fact outb/inb devices exists isn't the question here. The question
you should clarify is: which of the PIO devices is performance
critical as much as MMIO with virtio/vhost? I mean even on real
hardware those devices aren't performance critical. I didn't run into
PIO drivers with properly configured guests.

> So, the difference between my suggested list (which I admit is just
> based on conjecture, not benchmarking) is that you add
> EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT,
> EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ,
> EXIT_REASON_CPUID.
> 
> Which of these make a difference for the hrtimer testcase?  It's of
> course totally fine to use benchmarks to prove that my intuition was
> bad---but you must also use them to show why your intuition is right. :)

The hrtimer flood hits on this:

           MSR_WRITE     338793    56.54%     5.51%      0.33us     34.44us      0.44us ( +-   0.20% )
   PENDING_INTERRUPT     168431    28.11%     2.52%      0.36us     32.06us      0.40us ( +-   0.28% )
    PREEMPTION_TIMER      91723    15.31%     1.32%      0.34us     30.51us      0.39us ( +-   0.41% )
  EXTERNAL_INTERRUPT        234     0.04%     0.00%      0.25us      5.53us      0.43us ( +-   5.67% )
                 HLT         65     0.01%    90.64%      0.49us 319933.79us  37562.71us ( +-  21.68% )
            MSR_READ          6     0.00%     0.00%      0.67us      1.96us      1.06us ( +-  17.97% )
       EPT_MISCONFIG          6     0.00%     0.01%      3.09us    105.50us     26.76us ( +-  62.10% )

PENDING_INTERRUPT is the big missing thing in your list. It probably
accounts for the bulk of slowdown from your list.  However I could
imagine other loads with higher external interrupt/hlt/rdmsr than the
hrtimer one so I didn't drop those. Other loads are hitting on a flood
of HLT and from host standpoint it's no a slow path. Not all OS have
the cpuidle haltpoll governor to mitigate the HLT frequency.

I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be
included.

The least useful are PAUSE, CPUID and MSR_READ, we could considering
dropping some of those (in the short term cpuid helps for benchmarking
to more accurately measure the performance improvement of not hitting
the retpoline there). I simply could imagine some load hitting
frequently on those too so I didn't drop them.

I also wonder if VMCALL should be added, certain loads hit on fairly
frequent VMCALL, but none of the one I benchmarked.
Paolo Bonzini Oct. 15, 2019, 7:46 p.m. UTC | #3
On 15/10/19 18:49, Andrea Arcangeli wrote:
> On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote:
>> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
>> should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
>> that are in the guest, the doorbell register might be MMIO or PIO.
> 
> The fact outb/inb devices exists isn't the question here. The question
> you should clarify is: which of the PIO devices is performance
> critical as much as MMIO with virtio/vhost?

virtio 0.9 uses PIO.

> I mean even on real hardware those devices aren't performance critical.

On virtual machines they're actually faster than MMIO because they don't
need to go through page table walks.

>> So, the difference between my suggested list (which I admit is just
>> based on conjecture, not benchmarking) is that you add
>> EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT,
>> EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ,
>> EXIT_REASON_CPUID.
>>
>> Which of these make a difference for the hrtimer testcase?  It's of
>> course totally fine to use benchmarks to prove that my intuition was
>> bad---but you must also use them to show why your intuition is right. :)
> 
> The hrtimer flood hits on this:
> 
>            MSR_WRITE     338793    56.54%     5.51%      0.33us     34.44us      0.44us ( +-   0.20% )
>    PENDING_INTERRUPT     168431    28.11%     2.52%      0.36us     32.06us      0.40us ( +-   0.28% )
>     PREEMPTION_TIMER      91723    15.31%     1.32%      0.34us     30.51us      0.39us ( +-   0.41% )
>   EXTERNAL_INTERRUPT        234     0.04%     0.00%      0.25us      5.53us      0.43us ( +-   5.67% )
>                  HLT         65     0.01%    90.64%      0.49us 319933.79us  37562.71us ( +-  21.68% )
>             MSR_READ          6     0.00%     0.00%      0.67us      1.96us      1.06us ( +-  17.97% )
>        EPT_MISCONFIG          6     0.00%     0.01%      3.09us    105.50us     26.76us ( +-  62.10% )
> 
> PENDING_INTERRUPT is the big missing thing in your list. It probably
> accounts for the bulk of slowdown from your list.

Makes sense.

> However I could imagine other loads with higher external
> interrupt/hlt/rdmsr than the hrtimer one so I didn't drop those.
External interrupts should only tick at 1 Hz on nohz_full kernels,
and even at 1000 Hz (if physical CPUs are not isolated) it should not
really matter.  We can include it since it has such a short handler so
the cost of retpolines is in % much more than other exits.

HLT is certainly a slow path, the guest only invokes if things such as
NAPI interrupt mitigation have failed.  As long as the guest stays in
halted state for a microsecond or so, the cost of retpoline will all but
disappear.

RDMSR again shouldn't be there, guests sometimes read the PMTimer (which
is an I/O port) or TSC but for example do not really ever read the APIC
TMCCT.

> I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be
> included.
> I also wonder if VMCALL should be added, certain loads hit on fairly
> frequent VMCALL, but none of the one I benchmarked.

I agree for external interrupt and pending interrupt, and VMCALL is fine
too.  In addition I'd add I/O instructions which are useful for some
guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT
tests).

Paolo
Andrea Arcangeli Oct. 15, 2019, 8:35 p.m. UTC | #4
On Tue, Oct 15, 2019 at 09:46:58PM +0200, Paolo Bonzini wrote:
> On 15/10/19 18:49, Andrea Arcangeli wrote:
> > On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote:
> >> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
> >> should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
> >> that are in the guest, the doorbell register might be MMIO or PIO.
> > 
> > The fact outb/inb devices exists isn't the question here. The question
> > you should clarify is: which of the PIO devices is performance
> > critical as much as MMIO with virtio/vhost?
> 
> virtio 0.9 uses PIO.

0.9 is a 12 years old protocol replaced several years ago. Anybody who
needs high performance won't be running it, the others can't perform
well to begin with, so I'm not sure exactly how it's relevant in this
microoptimization context. We're not optimizing for emulated devices
or other old stuff either.

> On virtual machines they're actually faster than MMIO because they don't
> need to go through page table walks.

And how does it help that they're faster if current virtio stopped
using them and nothing else recent uses PIO?

> HLT is certainly a slow path, the guest only invokes if things such as

Your idea that HLT is a certainly is a slow path is only correct if
you assume the host is IDLE, but the host is never idle if you use
virt for consolidation.

From the point of view of the host, HLT is like every other vmexit.

> NAPI interrupt mitigation have failed.  As long as the guest stays in
> halted state for a microsecond or so, the cost of retpoline will all but
> disappear.

The only thing that matters is the number of HLT vmexit per second and
you just need to measure the number of HLT vmexits to tell it's
needed.

I've several workloads including eBPF tracing, not related to
interrupts (that in turn cannot be mitigated by NAPI) that schedule
frequently and hit 100k+ of HLT vmexits per second and the host is all
but idle. There's no need of hardware interrupt to wake up tasks and
schedule in the guest, scheduler IPIs and timers are more than enough.

The only thing that can mitigate that is the cpuidle haltpoll driver,
but it hit upstream a few months ago, all most recent enterprise
guest OS won't have it yet.

All it matters is how many vmexits per second there are, everything
else including "why" they happen and what those vmexists means for the
guest, is irrelevant, or it would be relevant only if the host was
guaranteed to be idle but there's no such guarantee.

If the host is using all idle CPUs to compute in the background
(i.e. building the kernel) with SCHED_IDLE the HLT retpoline cost will
not be any different than any other vmexit retpoline cost and easy
+100k HLT exit per second certainly puts it in the measurable
territory.

Here's a random example:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time 

                 HLT     101128    75.33%    99.66%      0.43us 901000.66us    310.88us ( +-   8.46% )
              VMCALL      14089    10.50%     0.10%      1.32us     84.99us      2.14us ( +-   0.90% )
           MSR_WRITE       8246     6.14%     0.03%      0.33us     32.79us      1.05us ( +-   1.51% )
       EPT_VIOLATION       6312     4.70%     0.18%      0.50us  26426.07us      8.90us ( +-  48.58% )
    PREEMPTION_TIMER       1730     1.29%     0.01%      0.55us     26.81us      1.60us ( +-   3.48% )
  EXTERNAL_INTERRUPT       1329     0.99%     0.03%      0.27us    944.88us      6.04us ( +-  20.52% )
       EPT_MISCONFIG        982     0.73%     0.01%      0.42us    137.68us      2.05us ( +-   9.88% )
   PENDING_INTERRUPT        308     0.23%     0.00%      0.44us      4.32us      0.73us ( +-   2.57% )
   PAUSE_INSTRUCTION         58     0.04%     0.00%      0.32us     18.55us      1.48us ( +-  23.12% )
            MSR_READ         35     0.03%     0.00%      0.78us      5.55us      2.07us ( +-  10.74% )
               CPUID         24     0.02%     0.00%      0.27us      2.20us      0.59us ( +-  13.43% )

# careful despite the verifier promise that eBPF shouldn't be kernel
# crashing this may be kernel crashing because there's no verifier at
# all that verifies that the eBPF function calls available depending
# on the hooking point can actually be invoked from the kernel hooking
# points they're invoked from. this is why I tested it in a VM
bpftrace -e 'kprobe:*interrupt* { @ = count() }'

Other example with a pipe loop that just bounces a byte across a pipe
with two processes:

             VM-EXIT    Samples  Samples%     Time%    Min Time    Max Time         Avg time 

           MSR_WRITE     498945    80.49%     4.10%      0.33us     42.73us      0.44us ( +-   0.12% )
                 HLT     118474    19.11%    95.88%      0.33us 707693.05us     43.56us ( +-  24.23% )
    PREEMPTION_TIMER       1004     0.16%     0.01%      0.38us     25.47us      0.67us ( +-   5.69% )
   PENDING_INTERRUPT        894     0.14%     0.01%      0.37us     20.98us      0.49us ( +-   4.94% )
  EXTERNAL_INTERRUPT        518     0.08%     0.00%      0.26us     20.59us      0.51us ( +-   8.09% )
            MSR_READ          8     0.00%     0.00%      0.66us      1.37us      0.92us ( +-   9.19% )
       EPT_MISCONFIG          6     0.00%     0.00%      3.18us     32.71us     12.60us ( +-  43.58% )
   PAUSE_INSTRUCTION          3     0.00%     0.00%      0.59us      1.69us      1.07us ( +-  30.38% )

We wouldn't need to apply the cpuidle-haltpoll driver if HLT wasn't
such a frequent vmexit that deserves to have its retpoline cost, not
multiplied by 100000 times per second.

Over time if everything will turn out to use the cpuidle-haltpoll
driver by default (that however can increase the host CPU usage on
laptops) we can consider removing the HLT optimization, we're not
remotely there yet.

> RDMSR again shouldn't be there, guests sometimes read the PMTimer (which
> is an I/O port) or TSC but for example do not really ever read the APIC
> TMCCT.

We can try to drop RDMSR, and see if it's measurable. I already tried
to re-add some of those retpolines but it was slower and this was the
fastest combination that I got, I don't recall if I tried with RDMSR
and PAUSE alone but I can try again.

> > I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be
> > included.
> > I also wonder if VMCALL should be added, certain loads hit on fairly
> > frequent VMCALL, but none of the one I benchmarked.
> 
> I agree for external interrupt and pending interrupt, and VMCALL is fine
> too.  In addition I'd add I/O instructions which are useful for some
> guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT
> tests).

Isn't it faster to use cpuid for benchmarking? I mean we don't want to
pay for more than one branch for benchmarking (even cpuid is
questionable in the long term, but for now it's handy to have), and
unlike inb/outb, cpuid runs occasionally in all real life workloads
(including in guest userland) so between inb/outb, I'd rather prefer
to use cpuid as the benchmark vector because at least it has a chance
to help real workloads a bit too.

Thanks,
Andrea
Paolo Bonzini Oct. 15, 2019, 10:22 p.m. UTC | #5
On 15/10/19 22:35, Andrea Arcangeli wrote:
> On Tue, Oct 15, 2019 at 09:46:58PM +0200, Paolo Bonzini wrote:
>> On 15/10/19 18:49, Andrea Arcangeli wrote:
>>> On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote:
>>>> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
>>>> should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
>>>> that are in the guest, the doorbell register might be MMIO or PIO.
>>>
>>> The fact outb/inb devices exists isn't the question here. The question
>>> you should clarify is: which of the PIO devices is performance
>>> critical as much as MMIO with virtio/vhost?
>>
>> virtio 0.9 uses PIO.
> 
> 0.9 is a 12 years old protocol replaced several years ago.

Oh come on.  0.9 is not 12-years old.  virtio 1.0 is 3.5 years old
(March 2016).  Anything older than 2017 is going to use 0.9.

> Your idea that HLT is a certainly is a slow path is only correct if
> you assume the host is IDLE, but the host is never idle if you use
> virt for consolidation.
>
> I've several workloads including eBPF tracing, not related to
> interrupts (that in turn cannot be mitigated by NAPI) that schedule
> frequently and hit 100k+ of HLT vmexits per second and the host is all
> but idle. There's no need of hardware interrupt to wake up tasks and
> schedule in the guest, scheduler IPIs and timers are more than enough.
>
> All it matters is how many vmexits per second there are, everything
> else including "why" they happen and what those vmexists means for the
> guest, is irrelevant, or it would be relevant only if the host was
> guaranteed to be idle but there's no such guarantee.

Your tables give:

	Samples	  Samples%  Time%     Min Time  Max time       Avg time
HLT     101128    75.33%    99.66%    0.43us    901000.66us    310.88us
HLT     118474    19.11%    95.88%    0.33us    707693.05us    43.56us

If "avg time" means the average time to serve an HLT vmexit, I don't
understand how you can have an average time of 0.3ms (1/3000th of a
second) and 100000 samples per second.  Can you explain that to me?

Anyway, if the average time is indeed 310us and 43us, it is orders of
magnitude more than the time spent executing a retpoline.  That time
will be spent in an indirect branch miss (retpoline) instead of doing
while(!kvm_vcpu_check_block()), but it doesn't change anything.

>>> I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be
>>> included.
>>> I also wonder if VMCALL should be added, certain loads hit on fairly
>>> frequent VMCALL, but none of the one I benchmarked.
>>
>> I agree for external interrupt and pending interrupt, and VMCALL is fine
>> too.  In addition I'd add I/O instructions which are useful for some
>> guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT
>> tests).
> 
> Isn't it faster to use cpuid for benchmarking? I mean we don't want to
> pay for more than one branch for benchmarking (even cpuid is
> questionable in the long term, but for now it's handy to have),

outl is more or less the same as cpuid and vmcall.  You can measure it
with vmexit.flat.  inl is slower.

> and unlike inb/outb, cpuid runs occasionally in all real life workloads
> (including in guest userland) so between inb/outb, I'd rather prefer
> to use cpuid as the benchmark vector because at least it has a chance
> to help real workloads a bit too.

Again: what is the real workload that does thousands of CPUIDs per second?

Paolo
Andrea Arcangeli Oct. 15, 2019, 11:42 p.m. UTC | #6
On Wed, Oct 16, 2019 at 12:22:31AM +0200, Paolo Bonzini wrote:
> Oh come on.  0.9 is not 12-years old.  virtio 1.0 is 3.5 years old
> (March 2016).  Anything older than 2017 is going to use 0.9.

Sorry if I got the date wrong, but still I don't see the point in
optimizing for legacy virtio. I can't justify forcing everyone to
execute that additional branch for inb/outb, in the attempt to make
legacy virtio faster that nobody should use in combination with
bleeding edge KVM in the host.

> Your tables give:
> 
> 	Samples	  Samples%  Time%     Min Time  Max time       Avg time
> HLT     101128    75.33%    99.66%    0.43us    901000.66us    310.88us
> HLT     118474    19.11%    95.88%    0.33us    707693.05us    43.56us
> 
> If "avg time" means the average time to serve an HLT vmexit, I don't
> understand how you can have an average time of 0.3ms (1/3000th of a
> second) and 100000 samples per second.  Can you explain that to me?

I described it wrong, the bpftrace record was a sleep 5, not a sleep
1. The pipe loop was sure a sleep 1.

I just wanted to show how even on things where you wouldn't even
expected to get HLT like the bpftrace that is pure guest CPU load, you
still get 100k of them (over 5 sec).

The issue is that in production you get a flood more of those with
hundred of CPUs, so the exact number doesn't move the needle.

> Anyway, if the average time is indeed 310us and 43us, it is orders of
> magnitude more than the time spent executing a retpoline.  That time
> will be spent in an indirect branch miss (retpoline) instead of doing
> while(!kvm_vcpu_check_block()), but it doesn't change anything.

Doesn't cpuidle haltpoll disable that loop? Ideally there should be
HLT vmexits then but I don't know how much fewer. This just needs to
be frequent enough that the branch cost pay itself off, but the sure
thing is that HLT vmexit will not go away unless you execute mwait in
guest mode by isolating the CPU in the host.

> Again: what is the real workload that does thousands of CPUIDs per second?

None, but there are always background CPUID vmexits while there are
never inb/outb vmexits.

So the cpuid retpoline removal has a slight chance to pay for the cost
of the branch, the inb/outb retpoline removal cannot pay off the cost
of the branch.

This is why I prefer cpuid as benchmark gadget for the short term
unless inb/outb offers other benchmark related benefits.

Thanks,
Andrea
Paolo Bonzini Oct. 16, 2019, 7:07 a.m. UTC | #7
On 16/10/19 01:42, Andrea Arcangeli wrote:
> On Wed, Oct 16, 2019 at 12:22:31AM +0200, Paolo Bonzini wrote:
>> Oh come on.  0.9 is not 12-years old.  virtio 1.0 is 3.5 years old
>> (March 2016).  Anything older than 2017 is going to use 0.9.
> 
> Sorry if I got the date wrong, but still I don't see the point in
> optimizing for legacy virtio. I can't justify forcing everyone to
> execute that additional branch for inb/outb, in the attempt to make
> legacy virtio faster that nobody should use in combination with
> bleeding edge KVM in the host.

Yet you would add CPUID to the list even though it is not even there in
your benchmarks, and is *never* invoked in a hot path by *any* sane
program? Some OSes have never gotten virtio 1.0 drivers.  OpenBSD only
got it earlier this year.

>> Your tables give:
>>
>> 	Samples	  Samples%  Time%     Min Time  Max time       Avg time
>> HLT     101128    75.33%    99.66%    0.43us    901000.66us    310.88us
>> HLT     118474    19.11%    95.88%    0.33us    707693.05us    43.56us
>>
>> If "avg time" means the average time to serve an HLT vmexit, I don't
>> understand how you can have an average time of 0.3ms (1/3000th of a
>> second) and 100000 samples per second.  Can you explain that to me?
> 
> I described it wrong, the bpftrace record was a sleep 5, not a sleep
> 1. The pipe loop was sure a sleep 1.

It still doesn't add up.  0.3ms / 5 is 1/15000th of a second; 43us is
1/25000th of a second.  Do you have multiple vCPU perhaps?

> The issue is that in production you get a flood more of those with
> hundred of CPUs, so the exact number doesn't move the needle.
> This just needs to be frequent enough that the branch cost pay itself off,
> but the sure thing is that HLT vmexit will not go away unless you execute
> mwait in guest mode by isolating the CPU in the host.

The number of vmexits doesn't count (for HLT).  What counts is how long
they take to be serviced, and as long as it's 1us or more the
optimization is pointless.

Consider these pictures

         w/o optimization                   with optimization
         ----------------------             -------------------------
0us      vmexit                             vmexit
500ns    retpoline                          call vmexit handler directly
600ns    retpoline                          kvm_vcpu_check_block()
700ns    retpoline                          kvm_vcpu_check_block()
800ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
900ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
...
39900ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()

                            <interrupt arrives>

40000ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()


Unless the interrupt arrives exactly in the few nanoseconds that it
takes to execute the retpoline, a direct handling of HLT vmexits makes
*absolutely no difference*.

>> Again: what is the real workload that does thousands of CPUIDs per second?
> 
> None, but there are always background CPUID vmexits while there are
> never inb/outb vmexits.
> 
> So the cpuid retpoline removal has a slight chance to pay for the cost
> of the branch, the inb/outb retpoline removal cannot pay off the cost
> of the branch.

Please stop considering only the exact configuration of your benchmarks.
 There are known, valid configurations where outb is a very hot vmexit.

Thanks,

Paolo
Andrea Arcangeli Oct. 16, 2019, 4:50 p.m. UTC | #8
On Wed, Oct 16, 2019 at 09:07:39AM +0200, Paolo Bonzini wrote:
> Yet you would add CPUID to the list even though it is not even there in
> your benchmarks, and is *never* invoked in a hot path by *any* sane

I justified CPUID as a "short term" benchmark gadget, it's one of
those it shouldn't be a problem at all to remove, I couldn't possibly
be against removing it. I only pointed out the fact cpuid on any
modern linux guest is going to run more frequently than any inb/outb
so if I had to pick a benchmark gadget, that remains my favorite one.

> program? Some OSes have never gotten virtio 1.0 drivers.  OpenBSD only
> got it earlier this year.

If the target is an optimization to a cranky OS that can't upgrade
virtio to obtain the full performance benefit from the retpoline
removal too (I don't know the specifics by just reading the above)
then it's a better argument. At least it sounds fair enough not to
unfair penalize the cranky OS forced to run obsolete protocols that
nobody can update or has the time to update.

I mean, until you said there's some OS that cannot upgrade to virtio
1.0, I thought it was perfectly fine to say "if you want to run a
guest with the full benefit of virtio 1.0 on KVM, you should upgrade
to virtio 1.0 and not stick to whatever 3 year old protocol, then also
the inb/outb retpoline will go away if you upgrade the host because
the inb/outb will go away in the first place".

> It still doesn't add up.  0.3ms / 5 is 1/15000th of a second; 43us is
> 1/25000th of a second.  Do you have multiple vCPU perhaps?

Why would I run any test on UP guests? Rather then spending time doing
the math on my results, it's probably quicker that you run it yourself:

https://lkml.kernel.org/r/20190109034941.28759-1-aarcange@redhat.com/

Marcelo should have better reproducers for frequent HLT that is a real
workload we have to pass, I reported the first two random things I had
around that reported fairly frequent HLT. The pipe loop load is
similar to local network I/O.

> The number of vmexits doesn't count (for HLT).  What counts is how long
> they take to be serviced, and as long as it's 1us or more the
> optimization is pointless.
> 
> Consider these pictures
> 
>          w/o optimization                   with optimization
>          ----------------------             -------------------------
> 0us      vmexit                             vmexit
> 500ns    retpoline                          call vmexit handler directly
> 600ns    retpoline                          kvm_vcpu_check_block()
> 700ns    retpoline                          kvm_vcpu_check_block()
> 800ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 900ns    kvm_vcpu_check_block()             kvm_vcpu_check_block()
> ...
> 39900ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 
>                             <interrupt arrives>
> 
> 40000ns  kvm_vcpu_check_block()             kvm_vcpu_check_block()
> 
> 
> Unless the interrupt arrives exactly in the few nanoseconds that it
> takes to execute the retpoline, a direct handling of HLT vmexits makes
> *absolutely no difference*.
> 

You keep focusing on what happens if the host is completely idle (in
which case guest HLT is a slow path) and you keep ignoring the case
that the host isn't completely idle (in which case guest HLT is not a
slow path).

Please note the single_task_running() check which immediately breaks
the kvm_vcpu_check_block() loop if there's even a single other task
that can be scheduled in the runqueue of the host CPU.

What happen when the host is not idle is quoted below:

         w/o optimization                   with optimization
         ----------------------             -------------------------
0us      vmexit                             vmexit
500ns    retpoline                          call vmexit handler directly
600ns    retpoline                          kvm_vcpu_check_block()
700ns    retpoline                          schedule()
800ns    kvm_vcpu_check_block()
900ns    schedule()
...

Disclaimer: the numbers on the left are arbitrary and I just cut and
pasted them from yours, no idea how far off they are.

To be clear, I would find it very reasonable to be requested to proof
the benefit of the HLT optimization with benchmarks specifics for that
single one liner, but until then, the idea that we can drop the
retpoline optimization from the HLT vmexit by just thinking about it,
still doesn't make sense to me, because by thinking about it I come to
the opposite conclusion.

The lack of single_task_running() in the guest driver is also why the
guest cpuidle haltpoll risks to waste some CPU with host overcommit or
with the host loaded at full capacity and why we may not assume it to
be universally enabled.

Thanks,
Andrea
Paolo Bonzini Oct. 16, 2019, 5:01 p.m. UTC | #9
On 16/10/19 18:50, Andrea Arcangeli wrote:
>> It still doesn't add up.  0.3ms / 5 is 1/15000th of a second; 43us is
>> 1/25000th of a second.  Do you have multiple vCPU perhaps?
> 
> Why would I run any test on UP guests? Rather then spending time doing
> the math on my results, it's probably quicker that you run it yourself:

I don't know, but if you don't say how many vCPUs you have, I cannot do
the math and review the patch.

>> The number of vmexits doesn't count (for HLT).  What counts is how long
>> they take to be serviced, and as long as it's 1us or more the
>> optimization is pointless.
>
> Please note the single_task_running() check which immediately breaks
> the kvm_vcpu_check_block() loop if there's even a single other task
> that can be scheduled in the runqueue of the host CPU.
> 
> What happen when the host is not idle is quoted below:
> 
>          w/o optimization                   with optimization
>          ----------------------             -------------------------
> 0us      vmexit                             vmexit
> 500ns    retpoline                          call vmexit handler directly
> 600ns    retpoline                          kvm_vcpu_check_block()
> 700ns    retpoline                          schedule()
> 800ns    kvm_vcpu_check_block()
> 900ns    schedule()
> ...
> 
> Disclaimer: the numbers on the left are arbitrary and I just cut and
> pasted them from yours, no idea how far off they are.

Yes, of course.  But the idea is the same: yes, because of the retpoline
you run the guest for perhaps 300ns more before schedule()ing, but does
that really matter?  300ns * 20000 times/second is a 0.6% performance
impact, and 300ns is already very generous.  I am not sure it would be
measurable at all.

Paolo

> To be clear, I would find it very reasonable to be requested to proof
> the benefit of the HLT optimization with benchmarks specifics for that
> single one liner, but until then, the idea that we can drop the
> retpoline optimization from the HLT vmexit by just thinking about it,
> still doesn't make sense to me, because by thinking about it I come to
> the opposite conclusion.
> 
> The lack of single_task_running() in the guest driver is also why the
> guest cpuidle haltpoll risks to waste some CPU with host overcommit or
> with the host loaded at full capacity and why we may not assume it to
> be universally enabled.
> 
> Thanks,
> Andrea
>
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index de3ae2246205..2bd57a7d2be1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5846,9 +5846,29 @@  int kvm_x86_handle_exit(struct kvm_vcpu *vcpu)
 	}
 
 	if (exit_reason < kvm_vmx_max_exit_handlers
-	    && kvm_vmx_exit_handlers[exit_reason])
+	    && kvm_vmx_exit_handlers[exit_reason]) {
+#ifdef CONFIG_RETPOLINE
+		if (exit_reason == EXIT_REASON_MSR_WRITE)
+			return kvm_emulate_wrmsr(vcpu);
+		else if (exit_reason == EXIT_REASON_PREEMPTION_TIMER)
+			return handle_preemption_timer(vcpu);
+		else if (exit_reason == EXIT_REASON_PENDING_INTERRUPT)
+			return handle_interrupt_window(vcpu);
+		else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+			return handle_external_interrupt(vcpu);
+		else if (exit_reason == EXIT_REASON_HLT)
+			return kvm_emulate_halt(vcpu);
+		else if (exit_reason == EXIT_REASON_PAUSE_INSTRUCTION)
+			return handle_pause(vcpu);
+		else if (exit_reason == EXIT_REASON_MSR_READ)
+			return kvm_emulate_rdmsr(vcpu);
+		else if (exit_reason == EXIT_REASON_CPUID)
+			return kvm_emulate_cpuid(vcpu);
+		else if (exit_reason == EXIT_REASON_EPT_MISCONFIG)
+			return handle_ept_misconfig(vcpu);
+#endif
 		return kvm_vmx_exit_handlers[exit_reason](vcpu);
-	else {
+	} else {
 		vcpu_unimpl(vcpu, "vmx: unexpected exit reason 0x%x\n",
 				exit_reason);
 		dump_vmcs();