diff mbox series

KVM: X86: fix tlb_flush_guest()

Message ID 20210527023922.2017-1-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: fix tlb_flush_guest() | expand

Commit Message

Lai Jiangshan May 27, 2021, 2:39 a.m. UTC
From: Lai Jiangshan <laijs@linux.alibaba.com>

For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
the hypervisor do the operation that equals to native_flush_tlb_global()
or invpcid_flush_all() in the specified guest CPU.

When TDP is enabled, there is no problem to just flush the hardware
TLB of the specified guest CPU.

But when using shadowpaging, the hypervisor should have to sync the
shadow pagetable at first before flushing the hardware TLB so that
it can truely emulate the operation of invpcid_flush_all() in guest.

The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
shootdowns").  But I don't think it would be a real world problem that
time since the local CPU's tlb is flushed at first in guest before queuing
KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.

After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
concurrently"), the guest doesn't flush local CPU's tlb at first and
the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
local VCPU's tlb flush and might flush the hardware tlb without syncing
the shadow pagetable beforehand.

Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
 arch/x86/kvm/vmx/vmx.c |  8 +++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini May 27, 2021, 12:55 p.m. UTC | #1
On 27/05/21 04:39, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> the hypervisor do the operation that equals to native_flush_tlb_global()
> or invpcid_flush_all() in the specified guest CPU.
> 
> When TDP is enabled, there is no problem to just flush the hardware
> TLB of the specified guest CPU.
> 
> But when using shadowpaging, the hypervisor should have to sync the
> shadow pagetable at first before flushing the hardware TLB so that
> it can truely emulate the operation of invpcid_flush_all() in guest.

Can you explain why?

Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if 
(tdp_enabled).  This provides also a single, good place to add a comment 
with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.

Paolo

> The problem exists since the first implementation of KVM_VCPU_FLUSH_TLB
> in commit f38a7b75267f ("KVM: X86: support paravirtualized help for TLB
> shootdowns").  But I don't think it would be a real world problem that
> time since the local CPU's tlb is flushed at first in guest before queuing
> KVM_VCPU_FLUSH_TLB to other CPUs.  It means that the hypervisor syncs the
> shadow pagetable before seeing the corresponding KVM_VCPU_FLUSH_TLBs.
> 
> After commit 4ce94eabac16 ("x86/mm/tlb: Flush remote and local TLBs
> concurrently"), the guest doesn't flush local CPU's tlb at first and
> the hypervisor can handle other VCPU's KVM_VCPU_FLUSH_TLB earlier than
> local VCPU's tlb flush and might flush the hardware tlb without syncing
> the shadow pagetable beforehand.
> 
> Fixes: f38a7b75267f ("KVM: X86: support paravirtualized help for TLB shootdowns")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/svm/svm.c | 16 +++++++++++++++-
>   arch/x86/kvm/vmx/vmx.c |  8 +++++++-
>   2 files changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 05eca131eaf2..f4523c859245 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3575,6 +3575,20 @@ void svm_flush_tlb(struct kvm_vcpu *vcpu)
>   		svm->current_vmcb->asid_generation--;
>   }
>   
> +static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * When NPT is enabled, just flush the ASID.
> +	 *
> +	 * When NPT is not enabled, the operation should be equal to
> +	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
> +	 */
> +	if (npt_enabled)
> +		svm_flush_tlb(vcpu);
> +	else
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +}
> +
>   static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
>   {
>   	struct vcpu_svm *svm = to_svm(vcpu);
> @@ -4486,7 +4500,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   	.tlb_flush_all = svm_flush_tlb,
>   	.tlb_flush_current = svm_flush_tlb,
>   	.tlb_flush_gva = svm_flush_tlb_gva,
> -	.tlb_flush_guest = svm_flush_tlb,
> +	.tlb_flush_guest = svm_flush_tlb_guest,
>   
>   	.run = svm_vcpu_run,
>   	.handle_exit = handle_exit,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4bceb5ca3a89..1913504e3472 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -3049,8 +3049,14 @@ static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   	 * are required to flush GVA->{G,H}PA mappings from the TLB if vpid is
>   	 * disabled (VM-Enter with vpid enabled and vpid==0 is disallowed),
>   	 * i.e. no explicit INVVPID is necessary.
> +	 *
> +	 * When EPT is not enabled, the operation should be equal to
> +	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
>   	 */
> -	vpid_sync_context(to_vmx(vcpu)->vpid);
> +	if (enable_ept)
> +		vpid_sync_context(to_vmx(vcpu)->vpid);
> +	else
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
>   }
>   
>   void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)
>
Sean Christopherson May 27, 2021, 4:13 p.m. UTC | #2
+Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
         issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
	 flushing.

On Thu, May 27, 2021, Paolo Bonzini wrote:
> On 27/05/21 04:39, Lai Jiangshan wrote:
> > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > 
> > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > the hypervisor do the operation that equals to native_flush_tlb_global()
> > or invpcid_flush_all() in the specified guest CPU.
> > 
> > When TDP is enabled, there is no problem to just flush the hardware
> > TLB of the specified guest CPU.
> > 
> > But when using shadowpaging, the hypervisor should have to sync the
> > shadow pagetable at first before flushing the hardware TLB so that
> > it can truely emulate the operation of invpcid_flush_all() in guest.
> 
> Can you explain why?

KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
that require a TLB flush to take effect, e.g. making a writable page read-only,
KVM waits until the guest explicitly does said flush to propagate the changes to
the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
page being writable when the guest expects it to be read-only.

> Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> (tdp_enabled).  This provides also a single, good place to add a comment
> with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.

Ya.  

KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
offset the performance gains of the paravirtualized flush.

And making a request won't work without revamping the order of request handling
in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
serviced before KVM_REQ_STEAL_UPDATE.

Cleaning up and documenting the MMU related requests is on my todo list, but the
immediate fix should be tiny and I can do my cleanups on top.

I believe the minimal fix is:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 81ab3b8f22e5..b0072063f9bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
 static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
        ++vcpu->stat.tlb_flush;
+
+       if (!tdp_enabled)
+               kvm_mmu_sync_roots(vcpu);
        static_call(kvm_x86_tlb_flush_guest)(vcpu);
 }
Sean Christopherson May 27, 2021, 4:14 p.m. UTC | #3
+Maxim for real this time...

On Thu, May 27, 2021, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>          issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.
> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
> > On 27/05/21 04:39, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > > the hypervisor do the operation that equals to native_flush_tlb_global()
> > > or invpcid_flush_all() in the specified guest CPU.
> > > 
> > > When TDP is enabled, there is no problem to just flush the hardware
> > > TLB of the specified guest CPU.
> > > 
> > > But when using shadowpaging, the hypervisor should have to sync the
> > > shadow pagetable at first before flushing the hardware TLB so that
> > > it can truely emulate the operation of invpcid_flush_all() in guest.
> > 
> > Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
> > Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> > (tdp_enabled).  This provides also a single, good place to add a comment
> > with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.  
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush.
> 
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.
> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>         ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);
>         static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
>
Sean Christopherson May 27, 2021, 7:28 p.m. UTC | #4
On Thu, May 27, 2021, Sean Christopherson wrote:
> > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > offset the performance gains of the paravirtualized flush.

Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
between flushing a specific mm and flushing the entire TLB.  The HyperV usage
(via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
isn't necessary as KVM just needs to sync all roots, not blast them away.  For
previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
fix will need to unload those roots.

And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
to sync the MMU on the second flush even though the guest can technically rely
on the first MOV CR3 to have synchronized any previous changes relative to the
fisrt MOV CR3.

Lai, if it's ok with you, I'll massage this patch as discussed and fold it into
a larger series to fix the other bugs and do additional cleanup/improvements.

> > I believe the minimal fix is:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 81ab3b8f22e5..b0072063f9bf 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
> >  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
> >  {
> >         ++vcpu->stat.tlb_flush;
> > +
> > +       if (!tdp_enabled)
> > +               kvm_mmu_sync_roots(vcpu);
> >         static_call(kvm_x86_tlb_flush_guest)(vcpu);
> >  }
Lai Jiangshan May 28, 2021, 12:18 a.m. UTC | #5
On 2021/5/28 00:13, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>           issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.
> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
>> On 27/05/21 04:39, Lai Jiangshan wrote:
>>> From: Lai Jiangshan <laijs@linux.alibaba.com>
>>>
>>> For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
>>> the hypervisor do the operation that equals to native_flush_tlb_global()
>>> or invpcid_flush_all() in the specified guest CPU.
>>>
>>> When TDP is enabled, there is no problem to just flush the hardware
>>> TLB of the specified guest CPU.
>>>
>>> But when using shadowpaging, the hypervisor should have to sync the
>>> shadow pagetable at first before flushing the hardware TLB so that
>>> it can truely emulate the operation of invpcid_flush_all() in guest.
>>
>> Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
>> Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
>> (tdp_enabled).  This provides also a single, good place to add a comment
>> with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush >
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.

Yes, it just fixes the said problem in the simplest way.
I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
(If the guest is not preempted, it will call invpcid_flush_all() and will be handled
by this way)


The improvement code will go later, and will not be backported.
The proper way to flush guest is to use code in

https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
as:
+		kvm_mmu_sync_roots(vcpu);
+		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
+		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
+			vcpu->arch.mmu->prev_roots[i].need_sync = true;

If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
roots in prev_roots.



> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   {
>          ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);

it doesn't handle prev_roots which are also needed as
shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).

>          static_call(kvm_x86_tlb_flush_guest)(vcpu);

For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
to make it consistent with other shadowpage code.

>   }
>   
>
Sean Christopherson May 28, 2021, 12:26 a.m. UTC | #6
On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> On 2021/5/28 00:13, Sean Christopherson wrote:
> > And making a request won't work without revamping the order of request handling
> > in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> > serviced before KVM_REQ_STEAL_UPDATE.
> 
> Yes, it just fixes the said problem in the simplest way.
> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
> by this way)

The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
in vcpu_enter_guest() and so the reload request won't be recognized until the
next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
guaranteed to run between the invcpid code and VM-Enter.

> The improvement code will go later, and will not be backported.

I would argue that introducing a potential performance regression is in itself a
bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

> The proper way to flush guest is to use code in
> 
> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
> as:
> +		kvm_mmu_sync_roots(vcpu);
> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
> 
> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
> roots in prev_roots.

I like the idea, I just haven't gotten around to reviewing that patch yet.

> > Cleaning up and documenting the MMU related requests is on my todo list, but the
> > immediate fix should be tiny and I can do my cleanups on top.
> > 
> > I believe the minimal fix is:
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 81ab3b8f22e5..b0072063f9bf 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
> >   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
> >   {
> >          ++vcpu->stat.tlb_flush;
> > +
> > +       if (!tdp_enabled)
> > +               kvm_mmu_sync_roots(vcpu);
> 
> it doesn't handle prev_roots which are also needed as
> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).

Ya, I belated realized this :-)

> >          static_call(kvm_x86_tlb_flush_guest)(vcpu);
> 
> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
> to make it consistent with other shadowpage code.
> 
> >   }
> >
Lai Jiangshan May 28, 2021, 1:13 a.m. UTC | #7
On 2021/5/28 03:28, Sean Christopherson wrote:
> On Thu, May 27, 2021, Sean Christopherson wrote:
>>> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
>>> offset the performance gains of the paravirtualized flush.
> 
> Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> fix will need to unload those roots.
> 
> And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> to sync the MMU on the second flush even though the guest can technically rely
> on the first MOV CR3 to have synchronized any previous changes relative to the
> fisrt MOV CR3.

Could you elaborate the problem please?
When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?

If CR4.PCIDE = 1 and bit 63 of the instruction’s source operand is 0, the instruction invalidates all TLB entries 
associated with the PCID specified in bits 11:0 of the instruction’s source operand except those for global pages. It 
also invalidates all entries in all paging-structure caches associated with that PCID. It is not required to invalidate 
entries in the TLBs and paging-structure caches that are associated with other PCIDs.

> 
> Lai, if it's ok with you, I'll massage this patch as discussed and fold it into
> a larger series to fix the other bugs and do additional cleanup/improvements.
> 
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>   static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>   {
>>>          ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>>          static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>>   }
Lai Jiangshan May 28, 2021, 1:29 a.m. UTC | #8
On 2021/5/28 08:26, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
>>
>> On 2021/5/28 00:13, Sean Christopherson wrote:
>>> And making a request won't work without revamping the order of request handling
>>> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
>>> serviced before KVM_REQ_STEAL_UPDATE.
>>
>> Yes, it just fixes the said problem in the simplest way.
>> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
>> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
>> by this way)
> 
> The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> in vcpu_enter_guest() and so the reload request won't be recognized until the
> next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> guaranteed to run between the invcpid code and VM-Enter.

Kvm will recheck the request before VM-enter.
See kvm_vcpu_exit_request().
It won't lost any request, it just causes a additional iteration.

> 
>> The improvement code will go later, and will not be backported.
> 
> I would argue that introducing a potential performance regression is in itself a
> bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

I think we can introduce a kvm_mmu_sync_roots_all() which syncs current
root and handles prev_roots (mark need_sync or drop) as cleanup/preparation patch
and then fix the problem.

Do we need to backport the cleanup/preparation patch or just backport the way
with KVM_REQ_MMU_RELOAD?


> 
>> The proper way to flush guest is to use code in
>>
>> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
>> as:
>> +		kvm_mmu_sync_roots(vcpu);
>> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
>> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
>>
>> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
>> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
>> roots in prev_roots.
> 
> I like the idea, I just haven't gotten around to reviewing that patch yet.
> 
>>> Cleaning up and documenting the MMU related requests is on my todo list, but the
>>> immediate fix should be tiny and I can do my cleanups on top.
>>>
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>    static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>    {
>>>           ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>
>> it doesn't handle prev_roots which are also needed as
>> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> 
> Ya, I belated realized this :-)
> 
>>>           static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>
>> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
>> to make it consistent with other shadowpage code.

For !tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
to make it consistent with other shadowpage code.

>>
>>>    }
>>>
Maxim Levitsky May 29, 2021, 10:12 p.m. UTC | #9
On Thu, 2021-05-27 at 16:13 +0000, Sean Christopherson wrote:
> +Maxim - A proper fix for this bug might fix your shadow paging + win10 boot
>          issue, this also affects the KVM_REQ_HV_TLB_FLUSH used for HyperV PV
> 	 flushing.

Still crashes with this patch sadly (tested now on my AMD laptop now).

This win10 boot bug without TDP paging is 100% reproducible,
although it crashes sometimes a bit differently, sometimes VM reboots right
at start of the boot, sometimes it just hangs forever, 
sometimes the VM bluescreens (with various reasons). 
This makes sense for random paging related corruption though.
 
I'll look at this patch more carefully soon.
 
I also have another bug which I put aside as well due to complexity
which involves running hyperv itself nested and and then migrating
the L1, which leads the hyperv VM bluescreen sooner or later,
(I don't remember anymore if that was both on intel and AMD or only intel,
but this didn’t involve any npt/ept disablement),
so I'll see if this patch helps with this bug as well.

Thanks,
	Best regards,
		Maxim Levitsky


> 
> On Thu, May 27, 2021, Paolo Bonzini wrote:
> > On 27/05/21 04:39, Lai Jiangshan wrote:
> > > From: Lai Jiangshan <laijs@linux.alibaba.com>
> > > 
> > > For KVM_VCPU_FLUSH_TLB used in kvm_flush_tlb_multi(), the guest expects
> > > the hypervisor do the operation that equals to native_flush_tlb_global()
> > > or invpcid_flush_all() in the specified guest CPU.
> > > 
> > > When TDP is enabled, there is no problem to just flush the hardware
> > > TLB of the specified guest CPU.
> > > 
> > > But when using shadowpaging, the hypervisor should have to sync the
> > > shadow pagetable at first before flushing the hardware TLB so that
> > > it can truely emulate the operation of invpcid_flush_all() in guest.
> > 
> > Can you explain why?
> 
> KVM's unsync logic hinges on guest TLB flushes.  For page permission modifications
> that require a TLB flush to take effect, e.g. making a writable page read-only,
> KVM waits until the guest explicitly does said flush to propagate the changes to
> the shadow page tables.  E.g. failure to sync PTEs could result in a read-only 4k
> page being writable when the guest expects it to be read-only.
> 
> > Also it is simpler to handle this in kvm_vcpu_flush_tlb_guest, using "if
> > (tdp_enabled).  This provides also a single, good place to add a comment
> > with the explanation of what invalid entries KVM_REQ_RELOAD is presenting.
> 
> Ya.  
> 
> KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> offset the performance gains of the paravirtualized flush.
> 
> And making a request won't work without revamping the order of request handling
> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> serviced before KVM_REQ_STEAL_UPDATE.
> 
> Cleaning up and documenting the MMU related requests is on my todo list, but the
> immediate fix should be tiny and I can do my cleanups on top.
> 
> I believe the minimal fix is:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 81ab3b8f22e5..b0072063f9bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>  static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>  {
>         ++vcpu->stat.tlb_flush;
> +
> +       if (!tdp_enabled)
> +               kvm_mmu_sync_roots(vcpu);
>         static_call(kvm_x86_tlb_flush_guest)(vcpu);
>  }
>  
>
Lai Jiangshan June 2, 2021, 8:13 a.m. UTC | #10
On 2021/5/28 08:26, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
>>
>> On 2021/5/28 00:13, Sean Christopherson wrote:
>>> And making a request won't work without revamping the order of request handling
>>> in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
>>> serviced before KVM_REQ_STEAL_UPDATE.
>>
>> Yes, it just fixes the said problem in the simplest way.
>> I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
>> (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
>> by this way)
> 
> The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> in vcpu_enter_guest() and so the reload request won't be recognized until the
> next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> guaranteed to run between the invcpid code and VM-Enter.
> 
>> The improvement code will go later, and will not be backported.
> 
> I would argue that introducing a potential performance regression is in itself a
> bug.  IMO, going straight to kvm_mmu_sync_roots() is not high risk.

Hello, Sean

Patch V2 address all these concerns. And it uses the minimal fix as you
suggested in your previous reply (fix it directly in kvm_vcpu_flush_tlb_guest())

Could you have a review again please?

Thanks
Lai.

> 
>> The proper way to flush guest is to use code in
>>
>> https://lore.kernel.org/lkml/20210525213920.3340-1-jiangshanlai@gmail.com/
>> as:
>> +		kvm_mmu_sync_roots(vcpu);
>> +		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); //or just call flush_current directly
>> +		for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>> +			vcpu->arch.mmu->prev_roots[i].need_sync = true;
>>
>> If need_sync patch is not accepted, we can just use kvm_mmu_sync_roots(vcpu)
>> to keep the current pagetable and use kvm_mmu_free_roots() to free all the other
>> roots in prev_roots.
> 
> I like the idea, I just haven't gotten around to reviewing that patch yet.
> 
>>> Cleaning up and documenting the MMU related requests is on my todo list, but the
>>> immediate fix should be tiny and I can do my cleanups on top.
>>>
>>> I believe the minimal fix is:
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 81ab3b8f22e5..b0072063f9bf 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -3072,6 +3072,9 @@ static void kvm_vcpu_flush_tlb_all(struct kvm_vcpu *vcpu)
>>>    static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
>>>    {
>>>           ++vcpu->stat.tlb_flush;
>>> +
>>> +       if (!tdp_enabled)
>>> +               kvm_mmu_sync_roots(vcpu);
>>
>> it doesn't handle prev_roots which are also needed as
>> shown in kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> 
> Ya, I belated realized this :-)
> 
>>>           static_call(kvm_x86_tlb_flush_guest)(vcpu);
>>
>> For tdp_enabled, I think it is better to use kvm_x86_tlb_flush_current()
>> to make it consistent with other shadowpage code.
>>
>>>    }
>>>
Sean Christopherson June 2, 2021, 3:01 p.m. UTC | #11
On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> On 2021/5/28 08:26, Sean Christopherson wrote:
> > On Fri, May 28, 2021, Lai Jiangshan wrote:
> > > 
> > > On 2021/5/28 00:13, Sean Christopherson wrote:
> > > > And making a request won't work without revamping the order of request handling
> > > > in vcpu_enter_guest(), e.g. KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_SYNC are both
> > > > serviced before KVM_REQ_STEAL_UPDATE.
> > > 
> > > Yes, it just fixes the said problem in the simplest way.
> > > I copied KVM_REQ_MMU_RELOAD from kvm_handle_invpcid(INVPCID_TYPE_ALL_INCL_GLOBAL).
> > > (If the guest is not preempted, it will call invpcid_flush_all() and will be handled
> > > by this way)
> > 
> > The problem is that record_steal_time() is called after KVM_REQ_MMU_RELOAD
> > in vcpu_enter_guest() and so the reload request won't be recognized until the
> > next VM-Exit.  It works for kvm_handle_invpcid() because vcpu_enter_guest() is
> > guaranteed to run between the invcpid code and VM-Enter.
> 
> Kvm will recheck the request before VM-enter.
> See kvm_vcpu_exit_request().

Ah, right, forgot requests are rechecked.  Thanks!
Sean Christopherson June 2, 2021, 3:09 p.m. UTC | #12
On Fri, May 28, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/5/28 03:28, Sean Christopherson wrote:
> > On Thu, May 27, 2021, Sean Christopherson wrote:
> > > > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > > > offset the performance gains of the paravirtualized flush.
> > 
> > Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> > between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> > (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> > isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> > previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> > fix will need to unload those roots.
> > 
> > And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> > broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> > flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> > to sync the MMU on the second flush even though the guest can technically rely
> > on the first MOV CR3 to have synchronized any previous changes relative to the
> > fisrt MOV CR3.
> 
> Could you elaborate the problem please?
> When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?

Scratch that, I was wrong.  The SDM explicitly states that other PCIDs don't
need to be flushed if CR4.PCIDE=1.
Sean Christopherson June 2, 2021, 10:07 p.m. UTC | #13
On Wed, Jun 02, 2021, Sean Christopherson wrote:
> On Fri, May 28, 2021, Lai Jiangshan wrote:
> > 
> > 
> > On 2021/5/28 03:28, Sean Christopherson wrote:
> > > On Thu, May 27, 2021, Sean Christopherson wrote:
> > > > > KVM_REQ_MMU_RELOAD is overkill, nuking the shadow page tables will completely
> > > > > offset the performance gains of the paravirtualized flush.
> > > 
> > > Argh, I take that back.  The PV KVM_VCPU_FLUSH_TLB flag doesn't distinguish
> > > between flushing a specific mm and flushing the entire TLB.  The HyperV usage
> > > (via KVM_REQ) also throws everything into a single bucket.  A full RELOAD still
> > > isn't necessary as KVM just needs to sync all roots, not blast them away.  For
> > > previous roots, KVM doesn't have a mechanism to defer the sync, so the immediate
> > > fix will need to unload those roots.
> > > 
> > > And looking at KVM's other flows, __kvm_mmu_new_pgd() and kvm_set_cr3() are also
> > > broken with respect to previous roots.  E.g. if the guest does a MOV CR3 that
> > > flushes the entire TLB, followed by a MOV CR3 with PCID_NOFLUSH=1, KVM will fail
> > > to sync the MMU on the second flush even though the guest can technically rely
> > > on the first MOV CR3 to have synchronized any previous changes relative to the
> > > fisrt MOV CR3.
> > 
> > Could you elaborate the problem please?
> > When can a MOV CR3 that needs to flush the entire TLB if PCID is enabled?
> 
> Scratch that, I was wrong.  The SDM explicitly states that other PCIDs don't
> need to be flushed if CR4.PCIDE=1.

*sigh*

I was partially right.  If the guest does

  1: MOV    B, %rax
     MOV %rax, %cr3

  2: <modify PTEs in B>

  3: MOV    A, %rax
     MOV %rax, %cr3
 
  4: MOV    B, %rax
     BTS  $63, %rax
     MOV %rax, %cr3

where A and B are CR3 values with the same PCID, then KVM will fail to sync B at
step (4) due to PCID_NOFLUSH, even though the guest can technically rely on
its modifications at step (2) to become visible at step (3) when the PCID is
flushed on CR3 load.

So it's not a full TLB flush, rather a flush of the PCID, which can theoretically
impact previous CR3 values.
diff mbox series

Patch

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 05eca131eaf2..f4523c859245 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3575,6 +3575,20 @@  void svm_flush_tlb(struct kvm_vcpu *vcpu)
 		svm->current_vmcb->asid_generation--;
 }
 
+static void svm_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * When NPT is enabled, just flush the ASID.
+	 *
+	 * When NPT is not enabled, the operation should be equal to
+	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
+	 */
+	if (npt_enabled)
+		svm_flush_tlb(vcpu);
+	else
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+}
+
 static void svm_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t gva)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -4486,7 +4500,7 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.tlb_flush_all = svm_flush_tlb,
 	.tlb_flush_current = svm_flush_tlb,
 	.tlb_flush_gva = svm_flush_tlb_gva,
-	.tlb_flush_guest = svm_flush_tlb,
+	.tlb_flush_guest = svm_flush_tlb_guest,
 
 	.run = svm_vcpu_run,
 	.handle_exit = handle_exit,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4bceb5ca3a89..1913504e3472 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3049,8 +3049,14 @@  static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	 * are required to flush GVA->{G,H}PA mappings from the TLB if vpid is
 	 * disabled (VM-Enter with vpid enabled and vpid==0 is disallowed),
 	 * i.e. no explicit INVVPID is necessary.
+	 *
+	 * When EPT is not enabled, the operation should be equal to
+	 * native_flush_tlb_global(), invpcid_flush_all() in guest.
 	 */
-	vpid_sync_context(to_vmx(vcpu)->vpid);
+	if (enable_ept)
+		vpid_sync_context(to_vmx(vcpu)->vpid);
+	else
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
 }
 
 void vmx_ept_load_pdptrs(struct kvm_vcpu *vcpu)