Message ID | 20200429093634.1514902-5-vkuznets@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: x86: Interrupt-based mechanism for async_pf 'page present' notifications | expand |
On Wed, Apr 29, 2020 at 2:36 AM Vitaly Kuznetsov <vkuznets@redhat.com> wrote: > > If two page ready notifications happen back to back the second one is not > delivered and the only mechanism we currently have is > kvm_check_async_pf_completion() check in vcpu_run() loop. The check will > only be performed with the next vmexit when it happens and in some cases > it may take a while. With interrupt based page ready notification delivery > the situation is even worse: unlike exceptions, interrupts are not handled > immediately so we must check if the slot is empty. This is slow and > unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate > the fact that the slot is free and host should check its notification > queue. Mandate using it for interrupt based type 2 APF event delivery. This seems functional, but I'm wondering if it could a bit simpler and more efficient if the data structure was a normal descriptor ring with the same number slots as whatever the maximum number of waiting pages is. Then there would never need to be any notification from the guest back to the host, since there would always be room for a notification. It might be even better if a single unified data structure was used for both notifications.
On 29/04/20 19:28, Andy Lutomirski wrote: > This seems functional, but I'm wondering if it could a bit simpler and > more efficient if the data structure was a normal descriptor ring with > the same number slots as whatever the maximum number of waiting pages > is. Then there would never need to be any notification from the guest > back to the host, since there would always be room for a notification. No, it would be much more complicated code for a slow path which is already order of magnitudes slower than a vmexit. It would also use much more memory. > It might be even better if a single unified data structure was used > for both notifications. That's a very bad idea since one is synchronous and one is asynchronous. Part of the proposal we agreed upon was to keep "page not ready" synchronous while making "page ready" an interrupt. The data structure for "page not ready" will be #VE. Paolo
On Wed, Apr 29, 2020 at 10:40 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 29/04/20 19:28, Andy Lutomirski wrote: > > This seems functional, but I'm wondering if it could a bit simpler and > > more efficient if the data structure was a normal descriptor ring with > > the same number slots as whatever the maximum number of waiting pages > > is. Then there would never need to be any notification from the guest > > back to the host, since there would always be room for a notification. > > No, it would be much more complicated code for a slow path which is > already order of magnitudes slower than a vmexit. It would also use > much more memory. Fair enough. > > > It might be even better if a single unified data structure was used > > for both notifications. > > That's a very bad idea since one is synchronous and one is asynchronous. > Part of the proposal we agreed upon was to keep "page not ready" > synchronous while making "page ready" an interrupt. The data structure > for "page not ready" will be #VE. > #VE on SVM will be interesting, to say the least, and I think that a solution that is VMX specific doesn't make much sense. #VE also has unpleasant issues involving the contexts in which it can occur. You will have quite a hard time convincing me to ack the addition of a #VE entry handler for this. I think a brand new vector is the right solution.
On 30/04/20 02:45, Andy Lutomirski wrote: >> That's a very bad idea since one is synchronous and one is asynchronous. >> Part of the proposal we agreed upon was to keep "page not ready" >> synchronous while making "page ready" an interrupt. The data structure >> for "page not ready" will be #VE. > > #VE on SVM will be interesting, to say the least, and I think that a > solution that is VMX specific doesn't make much sense. You can always inject it manually. The same is true of Haswell and earlier processors. > #VE also has > unpleasant issues involving the contexts in which it can occur. You > will have quite a hard time convincing me to ack the addition of a #VE > entry handler for this. I think a brand new vector is the right > solution. I need --verbose. :) For #VE I liked the idea of re-enabling it from an IPI, at least in the case where we cannot move out of the IST stack. And any other vector that behaves like an exception would have the same issue, wouldn't it (especially re-entrancy)? Paolo
On 29/04/20 11:36, Vitaly Kuznetsov wrote: > + case MSR_KVM_ASYNC_PF_ACK: > + if (data & 0x1) > + kvm_check_async_pf_completion(vcpu); > + break; Does this work if interrupts are turned off? I think in that case kvm_check_async_pf_completion will refuse to make progress. You need to make this bit stateful (e.g. 1 = async PF in progress, 0 = not in progress), and check that for page ready notifications instead of EFLAGS.IF. This probably means that; - it might be simpler to move it to the vector MSR - it's definitely much simpler to remove the #PF-based mechanism for injecting page ready notifications. Thanks, Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 29/04/20 11:36, Vitaly Kuznetsov wrote: >> + case MSR_KVM_ASYNC_PF_ACK: >> + if (data & 0x1) >> + kvm_check_async_pf_completion(vcpu); >> + break; > > Does this work if interrupts are turned off? > I think in that case > kvm_check_async_pf_completion will refuse to make progress. > You need to make this bit stateful (e.g. 1 = async PF in progress, 0 = > not in progress), and check that for page ready notifications instead of > EFLAGS.IF. > This probably means that; > > - it might be simpler to move it to the vector MSR I didn't want to merge 'ACK' with the vector MSR as it forces the guest to remember the setting. It doesn't matter at all for Linux as we hardcode the interrupt number but I can imaging an OS assigning IRQ numbers dynamically, it'll need to keep record to avoid doing rdmsr. > - it's definitely much simpler to remove the #PF-based mechanism for > injecting page ready notifications. Yea, the logic in kvm_can_do_async_pf()/kvm_can_deliver_async_pf() becomes cumbersome. If we are to drop #PF-based mechanism I'd split it completely from the remaining synchronious #PF for page-not-present: basically, we only need to check that the slot (which we agreed becomes completely separate) is empty, interrupt/pending expception/... state becomes irrelevant.
On 30/04/20 10:40, Vitaly Kuznetsov wrote: >> I think in that case >> kvm_check_async_pf_completion will refuse to make progress. >> You need to make this bit stateful (e.g. 1 = async PF in progress, 0 = >> not in progress), and check that for page ready notifications instead of >> EFLAGS.IF. >> This probably means that; >> >> - it might be simpler to move it to the vector MSR > I didn't want to merge 'ACK' with the vector MSR as it forces the guest > to remember the setting. It doesn't matter at all for Linux as we > hardcode the interrupt number but I can imaging an OS assigning IRQ > numbers dynamically, it'll need to keep record to avoid doing rdmsr. I would expect that it needs to keep it in a global variable anyway, but yes this is a good point. You can also keep the ACK MSR and store the pending bit in the other MSR, kind of like you have separate ISR and EOI registers in the LAPIC. >> - it's definitely much simpler to remove the #PF-based mechanism for >> injecting page ready notifications. > Yea, the logic in kvm_can_do_async_pf()/kvm_can_deliver_async_pf() > becomes cumbersome. If we are to drop #PF-based mechanism I'd split it > completely from the remaining synchronious #PF for page-not-present: > basically, we only need to check that the slot (which we agreed becomes > completely separate) is empty, interrupt/pending expception/... state > becomes irrelevant. Yes, that's a good point. Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 30/04/20 10:40, Vitaly Kuznetsov wrote: >>> I think in that case >>> kvm_check_async_pf_completion will refuse to make progress. >>> You need to make this bit stateful (e.g. 1 = async PF in progress, 0 = >>> not in progress), and check that for page ready notifications instead of >>> EFLAGS.IF. >>> This probably means that; >>> >>> - it might be simpler to move it to the vector MSR >> I didn't want to merge 'ACK' with the vector MSR as it forces the guest >> to remember the setting. It doesn't matter at all for Linux as we >> hardcode the interrupt number but I can imaging an OS assigning IRQ >> numbers dynamically, it'll need to keep record to avoid doing rdmsr. > > I would expect that it needs to keep it in a global variable anyway, but > yes this is a good point. You can also keep the ACK MSR and store the > pending bit in the other MSR, kind of like you have separate ISR and EOI > registers in the LAPIC. > Honestly I was inspired by Hyper-V's HV_X64_MSR_EOM MSR as the protocol we're trying to come up with here is very similar to HV messaging) I'm not exactly sure why we need the pending bit after we drop #PF. When we call kvm_check_async_pf_completion() from MSR_KVM_ASYNC_PF_ACK write it will (in case there are page ready events in the queue) check if the slot is empty, put one there and raise IRQ regardless of guest's current state. It may or may not get injected immediately but we don't care. The second invocation of kvm_check_async_pf_completion() from vcpu_run() will just go away. I'm probably just missing something, will think of it again while working on v1, it seems nobody is against the idea in general. Thanks!
On 30/04/20 13:33, Vitaly Kuznetsov wrote: >> I would expect that it needs to keep it in a global variable anyway, but >> yes this is a good point. You can also keep the ACK MSR and store the >> pending bit in the other MSR, kind of like you have separate ISR and EOI >> registers in the LAPIC. >> > Honestly I was inspired by Hyper-V's HV_X64_MSR_EOM MSR as the protocol > we're trying to come up with here is very similar to HV messaging) Oh, that's true actually. > I'm not exactly sure why we need the pending bit after we drop #PF. When > we call kvm_check_async_pf_completion() from MSR_KVM_ASYNC_PF_ACK write > it will (in case there are page ready events in the queue) check if the > slot is empty, put one there and raise IRQ regardless of guest's current > state. It may or may not get injected immediately but we don't care.> The second invocation of kvm_check_async_pf_completion() from vcpu_run() > will just go away. You're right, you can just use the value in the guest to see if the guest is ready. This is also similar to how #VE handles re-entrancy, however because this is an interrupt we have IF to delay the IRQ until after the interrupt handler has finished. By dropping the #PF page ready case, we can also drop the ugly case where WRMSR injects a page ready page fault even if IF=0. That one is safe on Linux, but Andy didn't like it. Paolo
Hi Vitaly, On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote: > If two page ready notifications happen back to back the second one is not > delivered and the only mechanism we currently have is > kvm_check_async_pf_completion() check in vcpu_run() loop. The check will > only be performed with the next vmexit when it happens and in some cases > it may take a while. With interrupt based page ready notification delivery > the situation is even worse: unlike exceptions, interrupts are not handled > immediately so we must check if the slot is empty. This is slow and > unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate > the fact that the slot is free and host should check its notification > queue. Mandate using it for interrupt based type 2 APF event delivery. > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> > --- > Documentation/virt/kvm/msr.rst | 16 +++++++++++++++- > arch/x86/include/uapi/asm/kvm_para.h | 1 + > arch/x86/kvm/x86.c | 9 ++++++++- > 3 files changed, 24 insertions(+), 2 deletions(-) > > diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst > index 7433e55f7184..18db3448db06 100644 > --- a/Documentation/virt/kvm/msr.rst > +++ b/Documentation/virt/kvm/msr.rst > @@ -219,6 +219,11 @@ data: > If during pagefault APF reason is 0 it means that this is regular > page fault. > > + For interrupt based delivery, guest has to write '1' to > + MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared > + 'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its > + queue and deliver next pending notification. > + > During delivery of type 1 APF cr2 contains a token that will > be used to notify a guest when missing page becomes > available. When page becomes available type 2 APF is sent with > @@ -340,4 +345,13 @@ data: > > To switch to interrupt based delivery of type 2 APF events guests > are supposed to enable asynchronous page faults and set bit 3 in > - MSR_KVM_ASYNC_PF_EN first. > + > +MSR_KVM_ASYNC_PF_ACK: > + 0x4b564d07 > + > +data: > + Asynchronous page fault acknowledgment. When the guest is done > + processing type 2 APF event and 'reason' field in 'struct > + kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to > + Bit 0 of the MSR, this caused the host to re-scan its queue and > + check if there are more notifications pending. I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK completely. It seems it's used to trapped to host, to have chance to check/deliver pending page ready events. If there is no pending events, no work will be finished in the trap. If it's true, it might be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK if there are really pending events? Thanks, Gavin > diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h > index 1bbb0b7e062f..5c7449980619 100644 > --- a/arch/x86/include/uapi/asm/kvm_para.h > +++ b/arch/x86/include/uapi/asm/kvm_para.h > @@ -51,6 +51,7 @@ > #define MSR_KVM_PV_EOI_EN 0x4b564d04 > #define MSR_KVM_POLL_CONTROL 0x4b564d05 > #define MSR_KVM_ASYNC_PF2 0x4b564d06 > +#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 > > struct kvm_steal_time { > __u64 steal; > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 861dce1e7cf5..e3b91ac33bfd 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = { > HV_X64_MSR_TSC_EMULATION_STATUS, > > MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, > - MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, > + MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, MSR_KVM_ASYNC_PF_ACK, > > MSR_IA32_TSC_ADJUST, > MSR_IA32_TSCDEADLINE, > @@ -2915,6 +2915,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > if (kvm_pv_enable_async_pf2(vcpu, data)) > return 1; > break; > + case MSR_KVM_ASYNC_PF_ACK: > + if (data & 0x1) > + kvm_check_async_pf_completion(vcpu); > + break; > case MSR_KVM_STEAL_TIME: > > if (unlikely(!sched_info_on())) > @@ -3194,6 +3198,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > case MSR_KVM_ASYNC_PF2: > msr_info->data = vcpu->arch.apf.msr2_val; > break; > + case MSR_KVM_ASYNC_PF_ACK: > + msr_info->data = 0; > + break; > case MSR_KVM_STEAL_TIME: > msr_info->data = vcpu->arch.st.msr_val; > break; >
Gavin Shan <gshan@redhat.com> writes: > Hi Vitaly, > > On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote: >> If two page ready notifications happen back to back the second one is not >> delivered and the only mechanism we currently have is >> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will >> only be performed with the next vmexit when it happens and in some cases >> it may take a while. With interrupt based page ready notification delivery >> the situation is even worse: unlike exceptions, interrupts are not handled >> immediately so we must check if the slot is empty. This is slow and >> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate >> the fact that the slot is free and host should check its notification >> queue. Mandate using it for interrupt based type 2 APF event delivery. >> >> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> >> --- >> Documentation/virt/kvm/msr.rst | 16 +++++++++++++++- >> arch/x86/include/uapi/asm/kvm_para.h | 1 + >> arch/x86/kvm/x86.c | 9 ++++++++- >> 3 files changed, 24 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst >> index 7433e55f7184..18db3448db06 100644 >> --- a/Documentation/virt/kvm/msr.rst >> +++ b/Documentation/virt/kvm/msr.rst >> @@ -219,6 +219,11 @@ data: >> If during pagefault APF reason is 0 it means that this is regular >> page fault. >> >> + For interrupt based delivery, guest has to write '1' to >> + MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared >> + 'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its >> + queue and deliver next pending notification. >> + >> During delivery of type 1 APF cr2 contains a token that will >> be used to notify a guest when missing page becomes >> available. When page becomes available type 2 APF is sent with >> @@ -340,4 +345,13 @@ data: >> >> To switch to interrupt based delivery of type 2 APF events guests >> are supposed to enable asynchronous page faults and set bit 3 in >> - MSR_KVM_ASYNC_PF_EN first. >> + >> +MSR_KVM_ASYNC_PF_ACK: >> + 0x4b564d07 >> + >> +data: >> + Asynchronous page fault acknowledgment. When the guest is done >> + processing type 2 APF event and 'reason' field in 'struct >> + kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to >> + Bit 0 of the MSR, this caused the host to re-scan its queue and >> + check if there are more notifications pending. > > I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK > completely. It seems it's used to trapped to host, to have chance > to check/deliver pending page ready events. If there is no pending > events, no work will be finished in the trap. If it's true, it might > be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK > if there are really pending events? How does the guest know if host has any pending events or not? The problem we're trying to address with ACK msr is the following: imagine host has two 'page ready' notifications back to back. It puts token for the first on in the slot and raises an IRQ but how do we know when the slot becomes free so we can inject the second one? Currently, we have kvm_check_async_pf_completion() check in vcpu_run() loop but this doesn't guarantee timely delivery of the event, we just hope that there's going to be a vmexit 'some time soon' and we'll piggy back onto that. Normally this works but in some special cases it may take really long before a vmexit happens. Also, we're penalizing each vmexit with an unneeded check. ACK msr is intended to solve these issues.
Hi Vitaly, On 5/5/20 6:16 PM, Vitaly Kuznetsov wrote: >> On 4/29/20 7:36 PM, Vitaly Kuznetsov wrote: >>> If two page ready notifications happen back to back the second one is not >>> delivered and the only mechanism we currently have is >>> kvm_check_async_pf_completion() check in vcpu_run() loop. The check will >>> only be performed with the next vmexit when it happens and in some cases >>> it may take a while. With interrupt based page ready notification delivery >>> the situation is even worse: unlike exceptions, interrupts are not handled >>> immediately so we must check if the slot is empty. This is slow and >>> unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate >>> the fact that the slot is free and host should check its notification >>> queue. Mandate using it for interrupt based type 2 APF event delivery. >>> >>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> >>> --- >>> Documentation/virt/kvm/msr.rst | 16 +++++++++++++++- >>> arch/x86/include/uapi/asm/kvm_para.h | 1 + >>> arch/x86/kvm/x86.c | 9 ++++++++- >>> 3 files changed, 24 insertions(+), 2 deletions(-) >>> >>> diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst >>> index 7433e55f7184..18db3448db06 100644 >>> --- a/Documentation/virt/kvm/msr.rst >>> +++ b/Documentation/virt/kvm/msr.rst >>> @@ -219,6 +219,11 @@ data: >>> If during pagefault APF reason is 0 it means that this is regular >>> page fault. >>> >>> + For interrupt based delivery, guest has to write '1' to >>> + MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared >>> + 'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its >>> + queue and deliver next pending notification. >>> + >>> During delivery of type 1 APF cr2 contains a token that will >>> be used to notify a guest when missing page becomes >>> available. When page becomes available type 2 APF is sent with >>> @@ -340,4 +345,13 @@ data: >>> >>> To switch to interrupt based delivery of type 2 APF events guests >>> are supposed to enable asynchronous page faults and set bit 3 in >>> - MSR_KVM_ASYNC_PF_EN first. >>> + >>> +MSR_KVM_ASYNC_PF_ACK: >>> + 0x4b564d07 >>> + >>> +data: >>> + Asynchronous page fault acknowledgment. When the guest is done >>> + processing type 2 APF event and 'reason' field in 'struct >>> + kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to >>> + Bit 0 of the MSR, this caused the host to re-scan its queue and >>> + check if there are more notifications pending. >> >> I'm not sure if I understood the usage of MSR_KVM_ASYNC_PF_ACK >> completely. It seems it's used to trapped to host, to have chance >> to check/deliver pending page ready events. If there is no pending >> events, no work will be finished in the trap. If it's true, it might >> be good idea to trap conditionally, meaning writing to ASYNC_PF_ACK >> if there are really pending events? > > How does the guest know if host has any pending events or not? > One way would be through struct kvm_vcpu_pv_apf_data, which is visible to host and guest. In the host, the workers that have completed their work (GUP) are queued into vcpu->async_pf.done. So we need a bit in struct kvm_vcpu_pv_apf_data, written by host while read by guest to see if there pending work. I even don't think the writer/reader need to be synchronized. > The problem we're trying to address with ACK msr is the following: > imagine host has two 'page ready' notifications back to back. It puts > token for the first on in the slot and raises an IRQ but how do we know > when the slot becomes free so we can inject the second one? Currently, > we have kvm_check_async_pf_completion() check in vcpu_run() loop but > this doesn't guarantee timely delivery of the event, we just hope that > there's going to be a vmexit 'some time soon' and we'll piggy back onto > that. Normally this works but in some special cases it may take really > long before a vmexit happens. Also, we're penalizing each vmexit with an > unneeded check. ACK msr is intended to solve these issues. > Thanks for the explanation. I think vmexit frequency is somewhat proportional to the workload, meaning more vmexit would be observed if the VM has more workload. The async page fault is part of the workload. I agree it makes sense to proactively poke the pending events ('page ready'), but it would be reasonable to do so conditionally, to avoid overhead. However, I'm not sure how much overhead it would have on x86 :) Thanks, Gavin
On 05/05/20 10:51, Gavin Shan wrote: >> How does the guest know if host has any pending events or not? > > One way would be through struct kvm_vcpu_pv_apf_data, which is visible > to host and guest. In the host, the workers that have completed their > work (GUP) are queued into vcpu->async_pf.done. So we need a bit in > struct kvm_vcpu_pv_apf_data, written by host while read by guest to > see if there pending work. I even don't think the writer/reader need > to be synchronized. No, this cannot work. The problem is that you need a way to tell the host "you can give me another ready page", and a memory location is not enough for that. Paolo
diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst index 7433e55f7184..18db3448db06 100644 --- a/Documentation/virt/kvm/msr.rst +++ b/Documentation/virt/kvm/msr.rst @@ -219,6 +219,11 @@ data: If during pagefault APF reason is 0 it means that this is regular page fault. + For interrupt based delivery, guest has to write '1' to + MSR_KVM_ASYNC_PF_ACK every time it clears reason in the shared + 'struct kvm_vcpu_pv_apf_data', this forces KVM to re-scan its + queue and deliver next pending notification. + During delivery of type 1 APF cr2 contains a token that will be used to notify a guest when missing page becomes available. When page becomes available type 2 APF is sent with @@ -340,4 +345,13 @@ data: To switch to interrupt based delivery of type 2 APF events guests are supposed to enable asynchronous page faults and set bit 3 in - MSR_KVM_ASYNC_PF_EN first. + +MSR_KVM_ASYNC_PF_ACK: + 0x4b564d07 + +data: + Asynchronous page fault acknowledgment. When the guest is done + processing type 2 APF event and 'reason' field in 'struct + kvm_vcpu_pv_apf_data' is cleared it is supposed to write '1' to + Bit 0 of the MSR, this caused the host to re-scan its queue and + check if there are more notifications pending. diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 1bbb0b7e062f..5c7449980619 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -51,6 +51,7 @@ #define MSR_KVM_PV_EOI_EN 0x4b564d04 #define MSR_KVM_POLL_CONTROL 0x4b564d05 #define MSR_KVM_ASYNC_PF2 0x4b564d06 +#define MSR_KVM_ASYNC_PF_ACK 0x4b564d07 struct kvm_steal_time { __u64 steal; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 861dce1e7cf5..e3b91ac33bfd 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1243,7 +1243,7 @@ static const u32 emulated_msrs_all[] = { HV_X64_MSR_TSC_EMULATION_STATUS, MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME, - MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, + MSR_KVM_PV_EOI_EN, MSR_KVM_ASYNC_PF2, MSR_KVM_ASYNC_PF_ACK, MSR_IA32_TSC_ADJUST, MSR_IA32_TSCDEADLINE, @@ -2915,6 +2915,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (kvm_pv_enable_async_pf2(vcpu, data)) return 1; break; + case MSR_KVM_ASYNC_PF_ACK: + if (data & 0x1) + kvm_check_async_pf_completion(vcpu); + break; case MSR_KVM_STEAL_TIME: if (unlikely(!sched_info_on())) @@ -3194,6 +3198,9 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_KVM_ASYNC_PF2: msr_info->data = vcpu->arch.apf.msr2_val; break; + case MSR_KVM_ASYNC_PF_ACK: + msr_info->data = 0; + break; case MSR_KVM_STEAL_TIME: msr_info->data = vcpu->arch.st.msr_val; break;
If two page ready notifications happen back to back the second one is not delivered and the only mechanism we currently have is kvm_check_async_pf_completion() check in vcpu_run() loop. The check will only be performed with the next vmexit when it happens and in some cases it may take a while. With interrupt based page ready notification delivery the situation is even worse: unlike exceptions, interrupts are not handled immediately so we must check if the slot is empty. This is slow and unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate the fact that the slot is free and host should check its notification queue. Mandate using it for interrupt based type 2 APF event delivery. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> --- Documentation/virt/kvm/msr.rst | 16 +++++++++++++++- arch/x86/include/uapi/asm/kvm_para.h | 1 + arch/x86/kvm/x86.c | 9 ++++++++- 3 files changed, 24 insertions(+), 2 deletions(-)