Message ID | 20190928172323.14663-13-aarcange@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM monolithic v2 | expand |
On 28/09/19 19:23, Andrea Arcangeli wrote: > Reducing this list to only EXIT_REASON_MSR_WRITE, > EXIT_REASON_PREEMPTION_TIMER, EXIT_REASON_EPT_MISCONFIG, > EXIT_REASON_IO_INSTRUCTION increases the computation time of the > hrtimer guest testcase on Haswell i5-4670T CPU @ 2.30GHz by 7% with > the default spectre v2 mitigation enabled in the host and guest. On > skylake as opposed there's no measurable difference with the short > list. To put things in prospective on Haswell the same hrtimer > workload (note: it never calls cpuid and it never attempts to trigger > more vmexit on purpose) in guest takes 16.3% longer to compute on > upstream KVM running in the host than with the KVM mono v1 patchset > applied to the host kernel, while on skylake the same takes only 5.4% > more time (both with the default mitigations enabled in guest and > host). > > It's also unclear why EXIT_REASON_IO_INSTRUCTION should be included. If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you should include EXIT_REASON_IO_INSTRUCTION too. Depending on the devices that are in the guest, the doorbell register might be MMIO or PIO. > + if (exit_reason == EXIT_REASON_MSR_WRITE) > + return kvm_emulate_wrmsr(vcpu); > + else if (exit_reason == EXIT_REASON_PREEMPTION_TIMER) > + return handle_preemption_timer(vcpu); > + else if (exit_reason == EXIT_REASON_PENDING_INTERRUPT) > + return handle_interrupt_window(vcpu); > + else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) > + return handle_external_interrupt(vcpu); > + else if (exit_reason == EXIT_REASON_HLT) > + return kvm_emulate_halt(vcpu); > + else if (exit_reason == EXIT_REASON_PAUSE_INSTRUCTION) > + return handle_pause(vcpu); > + else if (exit_reason == EXIT_REASON_MSR_READ) > + return kvm_emulate_rdmsr(vcpu); > + else if (exit_reason == EXIT_REASON_CPUID) > + return kvm_emulate_cpuid(vcpu); > + else if (exit_reason == EXIT_REASON_EPT_MISCONFIG) > + return handle_ept_misconfig(vcpu); So, the difference between my suggested list (which I admit is just based on conjecture, not benchmarking) is that you add EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT, EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ, EXIT_REASON_CPUID. Which of these make a difference for the hrtimer testcase? It's of course totally fine to use benchmarks to prove that my intuition was bad---but you must also use them to show why your intuition is right. :) Paolo
On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote: > If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you > should include EXIT_REASON_IO_INSTRUCTION too. Depending on the devices > that are in the guest, the doorbell register might be MMIO or PIO. The fact outb/inb devices exists isn't the question here. The question you should clarify is: which of the PIO devices is performance critical as much as MMIO with virtio/vhost? I mean even on real hardware those devices aren't performance critical. I didn't run into PIO drivers with properly configured guests. > So, the difference between my suggested list (which I admit is just > based on conjecture, not benchmarking) is that you add > EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT, > EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ, > EXIT_REASON_CPUID. > > Which of these make a difference for the hrtimer testcase? It's of > course totally fine to use benchmarks to prove that my intuition was > bad---but you must also use them to show why your intuition is right. :) The hrtimer flood hits on this: MSR_WRITE 338793 56.54% 5.51% 0.33us 34.44us 0.44us ( +- 0.20% ) PENDING_INTERRUPT 168431 28.11% 2.52% 0.36us 32.06us 0.40us ( +- 0.28% ) PREEMPTION_TIMER 91723 15.31% 1.32% 0.34us 30.51us 0.39us ( +- 0.41% ) EXTERNAL_INTERRUPT 234 0.04% 0.00% 0.25us 5.53us 0.43us ( +- 5.67% ) HLT 65 0.01% 90.64% 0.49us 319933.79us 37562.71us ( +- 21.68% ) MSR_READ 6 0.00% 0.00% 0.67us 1.96us 1.06us ( +- 17.97% ) EPT_MISCONFIG 6 0.00% 0.01% 3.09us 105.50us 26.76us ( +- 62.10% ) PENDING_INTERRUPT is the big missing thing in your list. It probably accounts for the bulk of slowdown from your list. However I could imagine other loads with higher external interrupt/hlt/rdmsr than the hrtimer one so I didn't drop those. Other loads are hitting on a flood of HLT and from host standpoint it's no a slow path. Not all OS have the cpuidle haltpoll governor to mitigate the HLT frequency. I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be included. The least useful are PAUSE, CPUID and MSR_READ, we could considering dropping some of those (in the short term cpuid helps for benchmarking to more accurately measure the performance improvement of not hitting the retpoline there). I simply could imagine some load hitting frequently on those too so I didn't drop them. I also wonder if VMCALL should be added, certain loads hit on fairly frequent VMCALL, but none of the one I benchmarked.
On 15/10/19 18:49, Andrea Arcangeli wrote: > On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote: >> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you >> should include EXIT_REASON_IO_INSTRUCTION too. Depending on the devices >> that are in the guest, the doorbell register might be MMIO or PIO. > > The fact outb/inb devices exists isn't the question here. The question > you should clarify is: which of the PIO devices is performance > critical as much as MMIO with virtio/vhost? virtio 0.9 uses PIO. > I mean even on real hardware those devices aren't performance critical. On virtual machines they're actually faster than MMIO because they don't need to go through page table walks. >> So, the difference between my suggested list (which I admit is just >> based on conjecture, not benchmarking) is that you add >> EXIT_REASON_PAUSE_INSTRUCTION, EXIT_REASON_PENDING_INTERRUPT, >> EXIT_REASON_EXTERNAL_INTERRUPT, EXIT_REASON_HLT, EXIT_REASON_MSR_READ, >> EXIT_REASON_CPUID. >> >> Which of these make a difference for the hrtimer testcase? It's of >> course totally fine to use benchmarks to prove that my intuition was >> bad---but you must also use them to show why your intuition is right. :) > > The hrtimer flood hits on this: > > MSR_WRITE 338793 56.54% 5.51% 0.33us 34.44us 0.44us ( +- 0.20% ) > PENDING_INTERRUPT 168431 28.11% 2.52% 0.36us 32.06us 0.40us ( +- 0.28% ) > PREEMPTION_TIMER 91723 15.31% 1.32% 0.34us 30.51us 0.39us ( +- 0.41% ) > EXTERNAL_INTERRUPT 234 0.04% 0.00% 0.25us 5.53us 0.43us ( +- 5.67% ) > HLT 65 0.01% 90.64% 0.49us 319933.79us 37562.71us ( +- 21.68% ) > MSR_READ 6 0.00% 0.00% 0.67us 1.96us 1.06us ( +- 17.97% ) > EPT_MISCONFIG 6 0.00% 0.01% 3.09us 105.50us 26.76us ( +- 62.10% ) > > PENDING_INTERRUPT is the big missing thing in your list. It probably > accounts for the bulk of slowdown from your list. Makes sense. > However I could imagine other loads with higher external > interrupt/hlt/rdmsr than the hrtimer one so I didn't drop those. External interrupts should only tick at 1 Hz on nohz_full kernels, and even at 1000 Hz (if physical CPUs are not isolated) it should not really matter. We can include it since it has such a short handler so the cost of retpolines is in % much more than other exits. HLT is certainly a slow path, the guest only invokes if things such as NAPI interrupt mitigation have failed. As long as the guest stays in halted state for a microsecond or so, the cost of retpoline will all but disappear. RDMSR again shouldn't be there, guests sometimes read the PMTimer (which is an I/O port) or TSC but for example do not really ever read the APIC TMCCT. > I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be > included. > I also wonder if VMCALL should be added, certain loads hit on fairly > frequent VMCALL, but none of the one I benchmarked. I agree for external interrupt and pending interrupt, and VMCALL is fine too. In addition I'd add I/O instructions which are useful for some guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT tests). Paolo
On Tue, Oct 15, 2019 at 09:46:58PM +0200, Paolo Bonzini wrote: > On 15/10/19 18:49, Andrea Arcangeli wrote: > > On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote: > >> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you > >> should include EXIT_REASON_IO_INSTRUCTION too. Depending on the devices > >> that are in the guest, the doorbell register might be MMIO or PIO. > > > > The fact outb/inb devices exists isn't the question here. The question > > you should clarify is: which of the PIO devices is performance > > critical as much as MMIO with virtio/vhost? > > virtio 0.9 uses PIO. 0.9 is a 12 years old protocol replaced several years ago. Anybody who needs high performance won't be running it, the others can't perform well to begin with, so I'm not sure exactly how it's relevant in this microoptimization context. We're not optimizing for emulated devices or other old stuff either. > On virtual machines they're actually faster than MMIO because they don't > need to go through page table walks. And how does it help that they're faster if current virtio stopped using them and nothing else recent uses PIO? > HLT is certainly a slow path, the guest only invokes if things such as Your idea that HLT is a certainly is a slow path is only correct if you assume the host is IDLE, but the host is never idle if you use virt for consolidation. From the point of view of the host, HLT is like every other vmexit. > NAPI interrupt mitigation have failed. As long as the guest stays in > halted state for a microsecond or so, the cost of retpoline will all but > disappear. The only thing that matters is the number of HLT vmexit per second and you just need to measure the number of HLT vmexits to tell it's needed. I've several workloads including eBPF tracing, not related to interrupts (that in turn cannot be mitigated by NAPI) that schedule frequently and hit 100k+ of HLT vmexits per second and the host is all but idle. There's no need of hardware interrupt to wake up tasks and schedule in the guest, scheduler IPIs and timers are more than enough. The only thing that can mitigate that is the cpuidle haltpoll driver, but it hit upstream a few months ago, all most recent enterprise guest OS won't have it yet. All it matters is how many vmexits per second there are, everything else including "why" they happen and what those vmexists means for the guest, is irrelevant, or it would be relevant only if the host was guaranteed to be idle but there's no such guarantee. If the host is using all idle CPUs to compute in the background (i.e. building the kernel) with SCHED_IDLE the HLT retpoline cost will not be any different than any other vmexit retpoline cost and easy +100k HLT exit per second certainly puts it in the measurable territory. Here's a random example: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time HLT 101128 75.33% 99.66% 0.43us 901000.66us 310.88us ( +- 8.46% ) VMCALL 14089 10.50% 0.10% 1.32us 84.99us 2.14us ( +- 0.90% ) MSR_WRITE 8246 6.14% 0.03% 0.33us 32.79us 1.05us ( +- 1.51% ) EPT_VIOLATION 6312 4.70% 0.18% 0.50us 26426.07us 8.90us ( +- 48.58% ) PREEMPTION_TIMER 1730 1.29% 0.01% 0.55us 26.81us 1.60us ( +- 3.48% ) EXTERNAL_INTERRUPT 1329 0.99% 0.03% 0.27us 944.88us 6.04us ( +- 20.52% ) EPT_MISCONFIG 982 0.73% 0.01% 0.42us 137.68us 2.05us ( +- 9.88% ) PENDING_INTERRUPT 308 0.23% 0.00% 0.44us 4.32us 0.73us ( +- 2.57% ) PAUSE_INSTRUCTION 58 0.04% 0.00% 0.32us 18.55us 1.48us ( +- 23.12% ) MSR_READ 35 0.03% 0.00% 0.78us 5.55us 2.07us ( +- 10.74% ) CPUID 24 0.02% 0.00% 0.27us 2.20us 0.59us ( +- 13.43% ) # careful despite the verifier promise that eBPF shouldn't be kernel # crashing this may be kernel crashing because there's no verifier at # all that verifies that the eBPF function calls available depending # on the hooking point can actually be invoked from the kernel hooking # points they're invoked from. this is why I tested it in a VM bpftrace -e 'kprobe:*interrupt* { @ = count() }' Other example with a pipe loop that just bounces a byte across a pipe with two processes: VM-EXIT Samples Samples% Time% Min Time Max Time Avg time MSR_WRITE 498945 80.49% 4.10% 0.33us 42.73us 0.44us ( +- 0.12% ) HLT 118474 19.11% 95.88% 0.33us 707693.05us 43.56us ( +- 24.23% ) PREEMPTION_TIMER 1004 0.16% 0.01% 0.38us 25.47us 0.67us ( +- 5.69% ) PENDING_INTERRUPT 894 0.14% 0.01% 0.37us 20.98us 0.49us ( +- 4.94% ) EXTERNAL_INTERRUPT 518 0.08% 0.00% 0.26us 20.59us 0.51us ( +- 8.09% ) MSR_READ 8 0.00% 0.00% 0.66us 1.37us 0.92us ( +- 9.19% ) EPT_MISCONFIG 6 0.00% 0.00% 3.18us 32.71us 12.60us ( +- 43.58% ) PAUSE_INSTRUCTION 3 0.00% 0.00% 0.59us 1.69us 1.07us ( +- 30.38% ) We wouldn't need to apply the cpuidle-haltpoll driver if HLT wasn't such a frequent vmexit that deserves to have its retpoline cost, not multiplied by 100000 times per second. Over time if everything will turn out to use the cpuidle-haltpoll driver by default (that however can increase the host CPU usage on laptops) we can consider removing the HLT optimization, we're not remotely there yet. > RDMSR again shouldn't be there, guests sometimes read the PMTimer (which > is an I/O port) or TSC but for example do not really ever read the APIC > TMCCT. We can try to drop RDMSR, and see if it's measurable. I already tried to re-add some of those retpolines but it was slower and this was the fastest combination that I got, I don't recall if I tried with RDMSR and PAUSE alone but I can try again. > > I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be > > included. > > I also wonder if VMCALL should be added, certain loads hit on fairly > > frequent VMCALL, but none of the one I benchmarked. > > I agree for external interrupt and pending interrupt, and VMCALL is fine > too. In addition I'd add I/O instructions which are useful for some > guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT > tests). Isn't it faster to use cpuid for benchmarking? I mean we don't want to pay for more than one branch for benchmarking (even cpuid is questionable in the long term, but for now it's handy to have), and unlike inb/outb, cpuid runs occasionally in all real life workloads (including in guest userland) so between inb/outb, I'd rather prefer to use cpuid as the benchmark vector because at least it has a chance to help real workloads a bit too. Thanks, Andrea
On 15/10/19 22:35, Andrea Arcangeli wrote: > On Tue, Oct 15, 2019 at 09:46:58PM +0200, Paolo Bonzini wrote: >> On 15/10/19 18:49, Andrea Arcangeli wrote: >>> On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote: >>>> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you >>>> should include EXIT_REASON_IO_INSTRUCTION too. Depending on the devices >>>> that are in the guest, the doorbell register might be MMIO or PIO. >>> >>> The fact outb/inb devices exists isn't the question here. The question >>> you should clarify is: which of the PIO devices is performance >>> critical as much as MMIO with virtio/vhost? >> >> virtio 0.9 uses PIO. > > 0.9 is a 12 years old protocol replaced several years ago. Oh come on. 0.9 is not 12-years old. virtio 1.0 is 3.5 years old (March 2016). Anything older than 2017 is going to use 0.9. > Your idea that HLT is a certainly is a slow path is only correct if > you assume the host is IDLE, but the host is never idle if you use > virt for consolidation. > > I've several workloads including eBPF tracing, not related to > interrupts (that in turn cannot be mitigated by NAPI) that schedule > frequently and hit 100k+ of HLT vmexits per second and the host is all > but idle. There's no need of hardware interrupt to wake up tasks and > schedule in the guest, scheduler IPIs and timers are more than enough. > > All it matters is how many vmexits per second there are, everything > else including "why" they happen and what those vmexists means for the > guest, is irrelevant, or it would be relevant only if the host was > guaranteed to be idle but there's no such guarantee. Your tables give: Samples Samples% Time% Min Time Max time Avg time HLT 101128 75.33% 99.66% 0.43us 901000.66us 310.88us HLT 118474 19.11% 95.88% 0.33us 707693.05us 43.56us If "avg time" means the average time to serve an HLT vmexit, I don't understand how you can have an average time of 0.3ms (1/3000th of a second) and 100000 samples per second. Can you explain that to me? Anyway, if the average time is indeed 310us and 43us, it is orders of magnitude more than the time spent executing a retpoline. That time will be spent in an indirect branch miss (retpoline) instead of doing while(!kvm_vcpu_check_block()), but it doesn't change anything. >>> I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be >>> included. >>> I also wonder if VMCALL should be added, certain loads hit on fairly >>> frequent VMCALL, but none of the one I benchmarked. >> >> I agree for external interrupt and pending interrupt, and VMCALL is fine >> too. In addition I'd add I/O instructions which are useful for some >> guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT >> tests). > > Isn't it faster to use cpuid for benchmarking? I mean we don't want to > pay for more than one branch for benchmarking (even cpuid is > questionable in the long term, but for now it's handy to have), outl is more or less the same as cpuid and vmcall. You can measure it with vmexit.flat. inl is slower. > and unlike inb/outb, cpuid runs occasionally in all real life workloads > (including in guest userland) so between inb/outb, I'd rather prefer > to use cpuid as the benchmark vector because at least it has a chance > to help real workloads a bit too. Again: what is the real workload that does thousands of CPUIDs per second? Paolo
On Wed, Oct 16, 2019 at 12:22:31AM +0200, Paolo Bonzini wrote: > Oh come on. 0.9 is not 12-years old. virtio 1.0 is 3.5 years old > (March 2016). Anything older than 2017 is going to use 0.9. Sorry if I got the date wrong, but still I don't see the point in optimizing for legacy virtio. I can't justify forcing everyone to execute that additional branch for inb/outb, in the attempt to make legacy virtio faster that nobody should use in combination with bleeding edge KVM in the host. > Your tables give: > > Samples Samples% Time% Min Time Max time Avg time > HLT 101128 75.33% 99.66% 0.43us 901000.66us 310.88us > HLT 118474 19.11% 95.88% 0.33us 707693.05us 43.56us > > If "avg time" means the average time to serve an HLT vmexit, I don't > understand how you can have an average time of 0.3ms (1/3000th of a > second) and 100000 samples per second. Can you explain that to me? I described it wrong, the bpftrace record was a sleep 5, not a sleep 1. The pipe loop was sure a sleep 1. I just wanted to show how even on things where you wouldn't even expected to get HLT like the bpftrace that is pure guest CPU load, you still get 100k of them (over 5 sec). The issue is that in production you get a flood more of those with hundred of CPUs, so the exact number doesn't move the needle. > Anyway, if the average time is indeed 310us and 43us, it is orders of > magnitude more than the time spent executing a retpoline. That time > will be spent in an indirect branch miss (retpoline) instead of doing > while(!kvm_vcpu_check_block()), but it doesn't change anything. Doesn't cpuidle haltpoll disable that loop? Ideally there should be HLT vmexits then but I don't know how much fewer. This just needs to be frequent enough that the branch cost pay itself off, but the sure thing is that HLT vmexit will not go away unless you execute mwait in guest mode by isolating the CPU in the host. > Again: what is the real workload that does thousands of CPUIDs per second? None, but there are always background CPUID vmexits while there are never inb/outb vmexits. So the cpuid retpoline removal has a slight chance to pay for the cost of the branch, the inb/outb retpoline removal cannot pay off the cost of the branch. This is why I prefer cpuid as benchmark gadget for the short term unless inb/outb offers other benchmark related benefits. Thanks, Andrea
On 16/10/19 01:42, Andrea Arcangeli wrote: > On Wed, Oct 16, 2019 at 12:22:31AM +0200, Paolo Bonzini wrote: >> Oh come on. 0.9 is not 12-years old. virtio 1.0 is 3.5 years old >> (March 2016). Anything older than 2017 is going to use 0.9. > > Sorry if I got the date wrong, but still I don't see the point in > optimizing for legacy virtio. I can't justify forcing everyone to > execute that additional branch for inb/outb, in the attempt to make > legacy virtio faster that nobody should use in combination with > bleeding edge KVM in the host. Yet you would add CPUID to the list even though it is not even there in your benchmarks, and is *never* invoked in a hot path by *any* sane program? Some OSes have never gotten virtio 1.0 drivers. OpenBSD only got it earlier this year. >> Your tables give: >> >> Samples Samples% Time% Min Time Max time Avg time >> HLT 101128 75.33% 99.66% 0.43us 901000.66us 310.88us >> HLT 118474 19.11% 95.88% 0.33us 707693.05us 43.56us >> >> If "avg time" means the average time to serve an HLT vmexit, I don't >> understand how you can have an average time of 0.3ms (1/3000th of a >> second) and 100000 samples per second. Can you explain that to me? > > I described it wrong, the bpftrace record was a sleep 5, not a sleep > 1. The pipe loop was sure a sleep 1. It still doesn't add up. 0.3ms / 5 is 1/15000th of a second; 43us is 1/25000th of a second. Do you have multiple vCPU perhaps? > The issue is that in production you get a flood more of those with > hundred of CPUs, so the exact number doesn't move the needle. > This just needs to be frequent enough that the branch cost pay itself off, > but the sure thing is that HLT vmexit will not go away unless you execute > mwait in guest mode by isolating the CPU in the host. The number of vmexits doesn't count (for HLT). What counts is how long they take to be serviced, and as long as it's 1us or more the optimization is pointless. Consider these pictures w/o optimization with optimization ---------------------- ------------------------- 0us vmexit vmexit 500ns retpoline call vmexit handler directly 600ns retpoline kvm_vcpu_check_block() 700ns retpoline kvm_vcpu_check_block() 800ns kvm_vcpu_check_block() kvm_vcpu_check_block() 900ns kvm_vcpu_check_block() kvm_vcpu_check_block() ... 39900ns kvm_vcpu_check_block() kvm_vcpu_check_block() <interrupt arrives> 40000ns kvm_vcpu_check_block() kvm_vcpu_check_block() Unless the interrupt arrives exactly in the few nanoseconds that it takes to execute the retpoline, a direct handling of HLT vmexits makes *absolutely no difference*. >> Again: what is the real workload that does thousands of CPUIDs per second? > > None, but there are always background CPUID vmexits while there are > never inb/outb vmexits. > > So the cpuid retpoline removal has a slight chance to pay for the cost > of the branch, the inb/outb retpoline removal cannot pay off the cost > of the branch. Please stop considering only the exact configuration of your benchmarks. There are known, valid configurations where outb is a very hot vmexit. Thanks, Paolo
On Wed, Oct 16, 2019 at 09:07:39AM +0200, Paolo Bonzini wrote: > Yet you would add CPUID to the list even though it is not even there in > your benchmarks, and is *never* invoked in a hot path by *any* sane I justified CPUID as a "short term" benchmark gadget, it's one of those it shouldn't be a problem at all to remove, I couldn't possibly be against removing it. I only pointed out the fact cpuid on any modern linux guest is going to run more frequently than any inb/outb so if I had to pick a benchmark gadget, that remains my favorite one. > program? Some OSes have never gotten virtio 1.0 drivers. OpenBSD only > got it earlier this year. If the target is an optimization to a cranky OS that can't upgrade virtio to obtain the full performance benefit from the retpoline removal too (I don't know the specifics by just reading the above) then it's a better argument. At least it sounds fair enough not to unfair penalize the cranky OS forced to run obsolete protocols that nobody can update or has the time to update. I mean, until you said there's some OS that cannot upgrade to virtio 1.0, I thought it was perfectly fine to say "if you want to run a guest with the full benefit of virtio 1.0 on KVM, you should upgrade to virtio 1.0 and not stick to whatever 3 year old protocol, then also the inb/outb retpoline will go away if you upgrade the host because the inb/outb will go away in the first place". > It still doesn't add up. 0.3ms / 5 is 1/15000th of a second; 43us is > 1/25000th of a second. Do you have multiple vCPU perhaps? Why would I run any test on UP guests? Rather then spending time doing the math on my results, it's probably quicker that you run it yourself: https://lkml.kernel.org/r/20190109034941.28759-1-aarcange@redhat.com/ Marcelo should have better reproducers for frequent HLT that is a real workload we have to pass, I reported the first two random things I had around that reported fairly frequent HLT. The pipe loop load is similar to local network I/O. > The number of vmexits doesn't count (for HLT). What counts is how long > they take to be serviced, and as long as it's 1us or more the > optimization is pointless. > > Consider these pictures > > w/o optimization with optimization > ---------------------- ------------------------- > 0us vmexit vmexit > 500ns retpoline call vmexit handler directly > 600ns retpoline kvm_vcpu_check_block() > 700ns retpoline kvm_vcpu_check_block() > 800ns kvm_vcpu_check_block() kvm_vcpu_check_block() > 900ns kvm_vcpu_check_block() kvm_vcpu_check_block() > ... > 39900ns kvm_vcpu_check_block() kvm_vcpu_check_block() > > <interrupt arrives> > > 40000ns kvm_vcpu_check_block() kvm_vcpu_check_block() > > > Unless the interrupt arrives exactly in the few nanoseconds that it > takes to execute the retpoline, a direct handling of HLT vmexits makes > *absolutely no difference*. > You keep focusing on what happens if the host is completely idle (in which case guest HLT is a slow path) and you keep ignoring the case that the host isn't completely idle (in which case guest HLT is not a slow path). Please note the single_task_running() check which immediately breaks the kvm_vcpu_check_block() loop if there's even a single other task that can be scheduled in the runqueue of the host CPU. What happen when the host is not idle is quoted below: w/o optimization with optimization ---------------------- ------------------------- 0us vmexit vmexit 500ns retpoline call vmexit handler directly 600ns retpoline kvm_vcpu_check_block() 700ns retpoline schedule() 800ns kvm_vcpu_check_block() 900ns schedule() ... Disclaimer: the numbers on the left are arbitrary and I just cut and pasted them from yours, no idea how far off they are. To be clear, I would find it very reasonable to be requested to proof the benefit of the HLT optimization with benchmarks specifics for that single one liner, but until then, the idea that we can drop the retpoline optimization from the HLT vmexit by just thinking about it, still doesn't make sense to me, because by thinking about it I come to the opposite conclusion. The lack of single_task_running() in the guest driver is also why the guest cpuidle haltpoll risks to waste some CPU with host overcommit or with the host loaded at full capacity and why we may not assume it to be universally enabled. Thanks, Andrea
On 16/10/19 18:50, Andrea Arcangeli wrote: >> It still doesn't add up. 0.3ms / 5 is 1/15000th of a second; 43us is >> 1/25000th of a second. Do you have multiple vCPU perhaps? > > Why would I run any test on UP guests? Rather then spending time doing > the math on my results, it's probably quicker that you run it yourself: I don't know, but if you don't say how many vCPUs you have, I cannot do the math and review the patch. >> The number of vmexits doesn't count (for HLT). What counts is how long >> they take to be serviced, and as long as it's 1us or more the >> optimization is pointless. > > Please note the single_task_running() check which immediately breaks > the kvm_vcpu_check_block() loop if there's even a single other task > that can be scheduled in the runqueue of the host CPU. > > What happen when the host is not idle is quoted below: > > w/o optimization with optimization > ---------------------- ------------------------- > 0us vmexit vmexit > 500ns retpoline call vmexit handler directly > 600ns retpoline kvm_vcpu_check_block() > 700ns retpoline schedule() > 800ns kvm_vcpu_check_block() > 900ns schedule() > ... > > Disclaimer: the numbers on the left are arbitrary and I just cut and > pasted them from yours, no idea how far off they are. Yes, of course. But the idea is the same: yes, because of the retpoline you run the guest for perhaps 300ns more before schedule()ing, but does that really matter? 300ns * 20000 times/second is a 0.6% performance impact, and 300ns is already very generous. I am not sure it would be measurable at all. Paolo > To be clear, I would find it very reasonable to be requested to proof > the benefit of the HLT optimization with benchmarks specifics for that > single one liner, but until then, the idea that we can drop the > retpoline optimization from the HLT vmexit by just thinking about it, > still doesn't make sense to me, because by thinking about it I come to > the opposite conclusion. > > The lack of single_task_running() in the guest driver is also why the > guest cpuidle haltpoll risks to waste some CPU with host overcommit or > with the host loaded at full capacity and why we may not assume it to > be universally enabled. > > Thanks, > Andrea >
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index de3ae2246205..2bd57a7d2be1 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5846,9 +5846,29 @@ int kvm_x86_handle_exit(struct kvm_vcpu *vcpu) } if (exit_reason < kvm_vmx_max_exit_handlers - && kvm_vmx_exit_handlers[exit_reason]) + && kvm_vmx_exit_handlers[exit_reason]) { +#ifdef CONFIG_RETPOLINE + if (exit_reason == EXIT_REASON_MSR_WRITE) + return kvm_emulate_wrmsr(vcpu); + else if (exit_reason == EXIT_REASON_PREEMPTION_TIMER) + return handle_preemption_timer(vcpu); + else if (exit_reason == EXIT_REASON_PENDING_INTERRUPT) + return handle_interrupt_window(vcpu); + else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT) + return handle_external_interrupt(vcpu); + else if (exit_reason == EXIT_REASON_HLT) + return kvm_emulate_halt(vcpu); + else if (exit_reason == EXIT_REASON_PAUSE_INSTRUCTION) + return handle_pause(vcpu); + else if (exit_reason == EXIT_REASON_MSR_READ) + return kvm_emulate_rdmsr(vcpu); + else if (exit_reason == EXIT_REASON_CPUID) + return kvm_emulate_cpuid(vcpu); + else if (exit_reason == EXIT_REASON_EPT_MISCONFIG) + return handle_ept_misconfig(vcpu); +#endif return kvm_vmx_exit_handlers[exit_reason](vcpu); - else { + } else { vcpu_unimpl(vcpu, "vmx: unexpected exit reason 0x%x\n", exit_reason); dump_vmcs();
It's enough to check the exit value and issue a direct call to avoid the retpoline for all the common vmexit reasons. Reducing this list to only EXIT_REASON_MSR_WRITE, EXIT_REASON_PREEMPTION_TIMER, EXIT_REASON_EPT_MISCONFIG, EXIT_REASON_IO_INSTRUCTION increases the computation time of the hrtimer guest testcase on Haswell i5-4670T CPU @ 2.30GHz by 7% with the default spectre v2 mitigation enabled in the host and guest. On skylake as opposed there's no measurable difference with the short list. To put things in prospective on Haswell the same hrtimer workload (note: it never calls cpuid and it never attempts to trigger more vmexit on purpose) in guest takes 16.3% longer to compute on upstream KVM running in the host than with the KVM mono v1 patchset applied to the host kernel, while on skylake the same takes only 5.4% more time (both with the default mitigations enabled in guest and host). It's also unclear why EXIT_REASON_IO_INSTRUCTION should be included. Of course CONFIG_RETPOLINE already forbids gcc not to do indirect jumps while compiling all switch() statements, however switch() would still allow the compiler to bisect the value, however it seems to run slower if something and the reason is that it's better to prioritize and do the minimal possible number of checks for the most common vmexit. The halt and pause loop exiting may be slow paths from the point of the guest, but not necessarily so from the point of the host. There can be a flood of halt exit reasons (in fact that's why the cpuidle guest haltpoll support was recently merged and we can't rely on it here because there are older kernels and other OS that must also perform optimally). All it takes is a pipe ping pong with a different host CPU and the host CPUs running at full capacity. The same consideration applies to the pause loop exiting exit reason, if there's heavy host overcommit that collides heavily in a spinlock the same may happen. In the common case of a fully idle host, the halt and pause loop exiting can't help, but adding them doesn't hurt the common case and the expectation here is that if they would ever become measurable, it would be because they are increasing (and not decreasing) performance. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- arch/x86/kvm/vmx/vmx.c | 24 ++++++++++++++++++++++-- 1 file changed, 22 insertions(+), 2 deletions(-)