mbox series

[v3,0/4] KVM: LAPIC: Implement Exitless Timer

Message ID 1560255429-7105-1-git-send-email-wanpengli@tencent.com (mailing list archive)
Headers show
Series KVM: LAPIC: Implement Exitless Timer | expand

Message

Wanpeng Li June 11, 2019, 12:17 p.m. UTC
Dedicated instances are currently disturbed by unnecessary jitter due 
to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
There is no hardware virtual timer on Intel for guest like ARM. Both 
programming timer in guest and the emulated timer fires incur vmexits.
This patchset tries to avoid vmexit which is incurred by the emulated 
timer fires in dedicated instance scenario. 

When nohz_full is enabled in dedicated instances scenario, the unpinned 
timer will be moved to the nearest busy housekeepers after commit 444969223c8
("sched/nohz: Fix affine unpinned timers mess"). However, KVM always makes 
lapic timer pinned to the pCPU which vCPU residents, the reason is explained 
by commit 61abdbe0 (kvm: x86: make lapic hrtimer pinned). Actually, these 
emulated timers can be offload to the housekeeping cpus since APICv 
is really common in recent years. The guest timer interrupt is injected by 
posted-interrupt which is delivered by housekeeping cpu once the emulated 
timer fires. 

The host admin should fine tuned, e.g. dedicated instances scenario w/ 
nohz_full cover the pCPUs which vCPUs resident, several pCPUs surplus 
for housekeeping, disable mwait/hlt/pause vmexits to occupy the pCPUs, 
fortunately preemption timer is disabled after mwait is exposed to 
guest which makes emulated timer offload can be possible. 
~3% redis performance benefit can be observed on Skylake server.

w/o patchset:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time

EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

w/ patchset:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time

EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )


v2 -> v3:
 * disarming the vmx preemption timer when posted_interrupt_inject_timer_enabled()
 * check kvm_hlt_in_guest instead

v1 -> v2:
 * check vcpu_halt_in_guest
 * move module parameter from kvm-intel to kvm
 * add housekeeping_enabled
 * rename apic_timer_expired_pi to kvm_apic_inject_pending_timer_irqs


Wanpeng Li (4):
  KVM: LAPIC: Make lapic timer unpinned when timer is injected by pi
  KVM: LAPIC: lapic timer interrupt is injected by posted interrupt
  KVM: LAPIC: Ignore timer migration when lapic timer is injected by pi
  KVM: LAPIC: add advance timer support to pi_inject_timer

 arch/x86/kvm/lapic.c            | 61 +++++++++++++++++++++++++++++++----------
 arch/x86/kvm/lapic.h            |  3 +-
 arch/x86/kvm/svm.c              |  2 +-
 arch/x86/kvm/vmx/vmx.c          |  5 ++--
 arch/x86/kvm/x86.c              |  5 ++++
 arch/x86/kvm/x86.h              |  2 ++
 include/linux/sched/isolation.h |  2 ++
 kernel/sched/isolation.c        |  6 ++++
 8 files changed, 68 insertions(+), 18 deletions(-)

Comments

Maxim Levitsky June 13, 2019, 7:59 a.m. UTC | #1
On Tue, 2019-06-11 at 20:17 +0800, Wanpeng Li wrote:
> Dedicated instances are currently disturbed by unnecessary jitter due 
> to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
> There is no hardware virtual timer on Intel for guest like ARM. Both 
> programming timer in guest and the emulated timer fires incur vmexits.
> This patchset tries to avoid vmexit which is incurred by the emulated 
> timer fires in dedicated instance scenario. 
> 
> When nohz_full is enabled in dedicated instances scenario, the unpinned 
> timer will be moved to the nearest busy housekeepers after commit 444969223c8
> ("sched/nohz: Fix affine unpinned timers mess"). However, KVM always makes 
> lapic timer pinned to the pCPU which vCPU residents, the reason is explained 
> by commit 61abdbe0 (kvm: x86: make lapic hrtimer pinned). Actually, these 
> emulated timers can be offload to the housekeeping cpus since APICv 
> is really common in recent years. The guest timer interrupt is injected by 
> posted-interrupt which is delivered by housekeeping cpu once the emulated 
> timer fires. 
> 
> The host admin should fine tuned, e.g. dedicated instances scenario w/ 
> nohz_full cover the pCPUs which vCPUs resident, several pCPUs surplus 
> for housekeeping, disable mwait/hlt/pause vmexits to occupy the pCPUs, 
> fortunately preemption timer is disabled after mwait is exposed to 
> guest which makes emulated timer offload can be possible. 
> ~3% redis performance benefit can be observed on Skylake server.

I don't yet know the kvm well enough to review this patch series, but overall I really like the idea.
I researched this area some time ago, to see what can be done to reduce the number of vmexits,
to an absolute minimum.

I have one small question, just out of curiosity.

Why do you require mwait in the guest to be enabled? 

If I understand it correctly, you say
that when mwait in the guest is disabled, then vmx preemption timer will be used,
and thus it will handle the apic timer?

Best regards,
	Maxim Levitsky
Wanpeng Li June 13, 2019, 8:25 a.m. UTC | #2
On Thu, 13 Jun 2019 at 15:59, Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2019-06-11 at 20:17 +0800, Wanpeng Li wrote:
> > Dedicated instances are currently disturbed by unnecessary jitter due
> > to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
> > There is no hardware virtual timer on Intel for guest like ARM. Both
> > programming timer in guest and the emulated timer fires incur vmexits.
> > This patchset tries to avoid vmexit which is incurred by the emulated
> > timer fires in dedicated instance scenario.
> >
> > When nohz_full is enabled in dedicated instances scenario, the unpinned
> > timer will be moved to the nearest busy housekeepers after commit 444969223c8
> > ("sched/nohz: Fix affine unpinned timers mess"). However, KVM always makes
> > lapic timer pinned to the pCPU which vCPU residents, the reason is explained
> > by commit 61abdbe0 (kvm: x86: make lapic hrtimer pinned). Actually, these
> > emulated timers can be offload to the housekeeping cpus since APICv
> > is really common in recent years. The guest timer interrupt is injected by
> > posted-interrupt which is delivered by housekeeping cpu once the emulated
> > timer fires.
> >
> > The host admin should fine tuned, e.g. dedicated instances scenario w/
> > nohz_full cover the pCPUs which vCPUs resident, several pCPUs surplus
> > for housekeeping, disable mwait/hlt/pause vmexits to occupy the pCPUs,
> > fortunately preemption timer is disabled after mwait is exposed to
> > guest which makes emulated timer offload can be possible.
> > ~3% redis performance benefit can be observed on Skylake server.
>
> I don't yet know the kvm well enough to review this patch series, but overall I really like the idea.

Thank you. :)

> I researched this area some time ago, to see what can be done to reduce the number of vmexits,
> to an absolute minimum.
>
> I have one small question, just out of curiosity.
>
> Why do you require mwait in the guest to be enabled?
>
> If I understand it correctly, you say
> that when mwait in the guest is disabled, then vmx preemption timer will be used,
> and thus it will handle the apic timer?

Actually we don't have this restriction in v3, the patchset
description need to update. The lapic timer which guest use can be
emulated by software(a hrtimer on host) or VT-x hardware (VMX
preemption timer). VMX preemption timer triggers vmexit when the timer
fires on the same pCPU which vCPU is running on, so the injection
vmexit can't be avoided. The hrtimer on host is used to emulate the
lapic timer when VMX preemption timer is disabled. After commit
9642d18eee2cd(nohz: Affine unpinned timers to housekeepers), unpinned
timers will be moved to nearest busy housekeepers, which means that we
can offload the hrtimer to the housekeeping cpus instead of running on
the pCPU which vCPU residents, the timer fires on the housekeeping
cpus and be injected by posted-interrupt to the vCPU w/o incur
vmexits. In patchset v3, the preemption timer will be disarmed if
lapic timer is injected by posted-interrupt. VMX preemption timer stop
working in C-states deeper than C2, that's why I utilize mwait expose
before. (commit 386c6ddbda1 X86/VMX: Disable VMX preemption timer if
MWAIT is not intercepted)

Regards,
Wanpeng Li
Maxim Levitsky June 13, 2019, 9:49 a.m. UTC | #3
On Thu, 2019-06-13 at 16:25 +0800, Wanpeng Li wrote:
> On Thu, 13 Jun 2019 at 15:59, Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > On Tue, 2019-06-11 at 20:17 +0800, Wanpeng Li wrote:
> > > Dedicated instances are currently disturbed by unnecessary jitter due
> > > to the emulated lapic timers fire on the same pCPUs which vCPUs resident.
> > > There is no hardware virtual timer on Intel for guest like ARM. Both
> > > programming timer in guest and the emulated timer fires incur vmexits.
> > > This patchset tries to avoid vmexit which is incurred by the emulated
> > > timer fires in dedicated instance scenario.
> > > 
> > > When nohz_full is enabled in dedicated instances scenario, the unpinned
> > > timer will be moved to the nearest busy housekeepers after commit 444969223c8
> > > ("sched/nohz: Fix affine unpinned timers mess"). However, KVM always makes
> > > lapic timer pinned to the pCPU which vCPU residents, the reason is explained
> > > by commit 61abdbe0 (kvm: x86: make lapic hrtimer pinned). Actually, these
> > > emulated timers can be offload to the housekeeping cpus since APICv
> > > is really common in recent years. The guest timer interrupt is injected by
> > > posted-interrupt which is delivered by housekeeping cpu once the emulated
> > > timer fires.
> > > 
> > > The host admin should fine tuned, e.g. dedicated instances scenario w/
> > > nohz_full cover the pCPUs which vCPUs resident, several pCPUs surplus
> > > for housekeeping, disable mwait/hlt/pause vmexits to occupy the pCPUs,
> > > fortunately preemption timer is disabled after mwait is exposed to
> > > guest which makes emulated timer offload can be possible.
> > > ~3% redis performance benefit can be observed on Skylake server.
> > 
> > I don't yet know the kvm well enough to review this patch series, but overall I really like the idea.
> 
> Thank you. :)
> 
> > I researched this area some time ago, to see what can be done to reduce the number of vmexits,
> > to an absolute minimum.
> > 
> > I have one small question, just out of curiosity.
> > 
> > Why do you require mwait in the guest to be enabled?
> > 
> > If I understand it correctly, you say
> > that when mwait in the guest is disabled, then vmx preemption timer will be used,
> > and thus it will handle the apic timer?
> 
> Actually we don't have this restriction in v3, the patchset
> description need to update. The lapic timer which guest use can be
> emulated by software(a hrtimer on host) or VT-x hardware (VMX
> preemption timer). VMX preemption timer triggers vmexit when the timer
> fires on the same pCPU which vCPU is running on, so the injection
> vmexit can't be avoided. The hrtimer on host is used to emulate the
> lapic timer when VMX preemption timer is disabled. After commit
> 9642d18eee2cd(nohz: Affine unpinned timers to housekeepers), unpinned
> timers will be moved to nearest busy housekeepers, which means that we
> can offload the hrtimer to the housekeeping cpus instead of running on
> the pCPU which vCPU residents, the timer fires on the housekeeping
> cpus and be injected by posted-interrupt to the vCPU w/o incur
> vmexits. In patchset v3, the preemption timer will be disarmed if
> lapic timer is injected by posted-interrupt. VMX preemption timer stop
> working in C-states deeper than C2, that's why I utilize mwait expose
> before. (commit 386c6ddbda1 X86/VMX: Disable VMX preemption timer if
> MWAIT is not intercepted)

That explains it very well. Thank you!

Best regards,
	Maxim Levitsky