Message ID | 20211019110154.4091-2-jiangshanlai@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: X86: Improve guest TLB flushing | expand |
On Tue, Oct 19, 2021, Lai Jiangshan wrote: > From: Lai Jiangshan <laijs@linux.alibaba.com> > > The KVM doesn't know whether any TLB for a specific pcid is cached in > the CPU when tdp is enabled. So it is better to flush all the guest > TLB when invalidating any single PCID context. > > The case is rare or even impossible since KVM doesn't intercept CR3 > write or INVPCID instructions when tdp is enabled. The fix is just > for the sake of robustness in case emulation can reach here or the > interception policy is changed. > > Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> > --- > arch/x86/kvm/x86.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index c59b63c56af9..06169ed08db0 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -1073,6 +1073,16 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid) > unsigned long roots_to_free = 0; > int i; > > + /* > + * It is very unlikely to reach here when tdp_enabled. But if it is > + * the case, the kvm doesn't know whether any TLB for the @pcid is > + * cached in the CPU. So just flush the guest instead. > + */ > + if (unlikely(tdp_enabled)) { This is reachable on VMX if EPT=1, unrestricted_guest=0, and CR0.PG=0. In that case, KVM is running the guest with the KVM-defined identity mapped CR3 / page tables and intercepts MOV CR3 so that the guest can't ovewrite the "real" CR3, and so that the guest sees its last written CR3 on read. This is also reachable from the emulator if the guest manipulates a vCPU code stream so that KVM sees a MOV CR3 after a legitimate emulation trigger. However, in both cases the KVM_REQ_TLB_FLUSH_GUEST is unnecessary. In the first case, paging is disabled so there are no TLB entries from the guest's perspective. In the second, the guest is malicious/broken and gets to keep the pieces. That said, I agree a sanity check is worthwhile, though with a reworded comment to call out the known scenarios and that the TDP page tables are not affected by the invalidation. Maybe this? /* * MOV CR3 and INVPCID are usually not intercepted when using TDP, but * this is reachable when running EPT=1 and unrestricted_guest=0, and * also via the emulator. KVM's TDP page tables are not in the scope of * the invalidation, but the guest's TLB entries need to be flushed as * the CPU may have cached entries in its TLB for the target PCID. */ > + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); > + return; > + } > + > /* > * If neither the current CR3 nor any of the prev_roots use the given > * PCID, then nothing needs to be done here because a resync will > -- > 2.19.1.6.gb485710b >
On 2021/10/19 23:25, Sean Christopherson wrote: > > /* > * MOV CR3 and INVPCID are usually not intercepted when using TDP, but > * this is reachable when running EPT=1 and unrestricted_guest=0, and > * also via the emulator. KVM's TDP page tables are not in the scope of > * the invalidation, but the guest's TLB entries need to be flushed as > * the CPU may have cached entries in its TLB for the target PCID. > */ Thanks! It is a better description. I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept() return true for some reasons/configs, #PF is intercepted. But CR3 write is not intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if the GPA of the new CR3 exceeds the guest maxphyaddr limit. And kvm queues a fault to the guest which is also _after_ the CR3 write, but the guest expects the fault before the write. IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT violation handler. Thanks Lai.
On Wed, Oct 20, 2021, Lai Jiangshan wrote: > On 2021/10/19 23:25, Sean Christopherson wrote: > I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept() > return true for some reasons/configs, #PF is intercepted. But CR3 write is not > intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if > the GPA of the new CR3 exceeds the guest maxphyaddr limit. And kvm queues a fault to > the guest which is also _after_ the CR3 write, but the guest expects the fault before > the write. > > IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT > violation handler. KVM implicitly does the latter by emulating the faulting instruction. static int handle_ept_violation(struct kvm_vcpu *vcpu) { ... /* * Check that the GPA doesn't exceed physical memory limits, as that is * a guest page fault. We have to emulate the instruction here, because * if the illegal address is that of a paging structure, then * EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we * would also use advanced VM-exit information for EPT violations to * reconstruct the page fault error code. */ if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa))) return kvm_emulate_instruction(vcpu, 0); return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); } and injecting a #GP when kvm_set_cr3() fails. static int em_cr_write(struct x86_emulate_ctxt *ctxt) { if (ctxt->ops->set_cr(ctxt, ctxt->modrm_reg, ctxt->src.val)) return emulate_gp(ctxt, 0); /* Disable writeback. */ ctxt->dst.type = OP_NONE; return X86EMUL_CONTINUE; }
On 2021/10/21 02:26, Sean Christopherson wrote: > On Wed, Oct 20, 2021, Lai Jiangshan wrote: >> On 2021/10/19 23:25, Sean Christopherson wrote: >> I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept() >> return true for some reasons/configs, #PF is intercepted. But CR3 write is not >> intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if >> the GPA of the new CR3 exceeds the guest maxphyaddr limit. And kvm queues a fault to >> the guest which is also _after_ the CR3 write, but the guest expects the fault before >> the write. >> >> IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT >> violation handler. > > KVM implicitly does the latter by emulating the faulting instruction. > > static int handle_ept_violation(struct kvm_vcpu *vcpu) > { > ... > > /* > * Check that the GPA doesn't exceed physical memory limits, as that is > * a guest page fault. We have to emulate the instruction here, because > * if the illegal address is that of a paging structure, then > * EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we > * would also use advanced VM-exit information for EPT violations to > * reconstruct the page fault error code. > */ > if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa))) > return kvm_emulate_instruction(vcpu, 0); > > return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > } > > and injecting a #GP when kvm_set_cr3() fails. I think the EPT violation happens *after* the cr3 write. So the instruction to be emulated is not "cr3 write". The emulation will queue fault into guest though, recursive EPT violation happens since the cr3 exceeds maxphyaddr limit. In this case, the guest is malicious/broken and gets to keep the pieces too. > > static int em_cr_write(struct x86_emulate_ctxt *ctxt) > { > if (ctxt->ops->set_cr(ctxt, ctxt->modrm_reg, ctxt->src.val)) > return emulate_gp(ctxt, 0); > > /* Disable writeback. */ > ctxt->dst.type = OP_NONE; > return X86EMUL_CONTINUE; > } >
On Thu, Oct 21, 2021, Lai Jiangshan wrote: > > > On 2021/10/21 02:26, Sean Christopherson wrote: > > On Wed, Oct 20, 2021, Lai Jiangshan wrote: > > > On 2021/10/19 23:25, Sean Christopherson wrote: > > > I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept() > > > return true for some reasons/configs, #PF is intercepted. But CR3 write is not > > > intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if > > > the GPA of the new CR3 exceeds the guest maxphyaddr limit. And kvm queues a fault to > > > the guest which is also _after_ the CR3 write, but the guest expects the fault before > > > the write. > > > > > > IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT > > > violation handler. > > > > KVM implicitly does the latter by emulating the faulting instruction. > > > > static int handle_ept_violation(struct kvm_vcpu *vcpu) > > { > > ... > > > > /* > > * Check that the GPA doesn't exceed physical memory limits, as that is > > * a guest page fault. We have to emulate the instruction here, because > > * if the illegal address is that of a paging structure, then > > * EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we > > * would also use advanced VM-exit information for EPT violations to > > * reconstruct the page fault error code. > > */ > > if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa))) > > return kvm_emulate_instruction(vcpu, 0); > > > > return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > } > > > > and injecting a #GP when kvm_set_cr3() fails. > > I think the EPT violation happens *after* the cr3 write. So the instruction to be > emulated is not "cr3 write". The emulation will queue fault into guest though, > recursive EPT violation happens since the cr3 exceeds maxphyaddr limit. Doh, you're correct. I think my mind wandered into thinking about what would happen with PDPTRs and forgot to get back to normal MOV CR3. So yeah, the only way to correctly handle this would be to intercept CR3 loads. I'm guessing that would have a noticeable impact on guest performance. Paolo, I'll leave this one for you to decide, we have pretty much written off allow_smaller_maxphyaddr :-)
On 21/10/21 16:52, Sean Christopherson wrote: >> I think the EPT violation happens*after* the cr3 write. So the instruction to be >> emulated is not "cr3 write". The emulation will queue fault into guest though, >> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit. > Doh, you're correct. I think my mind wandered into thinking about what would > happen with PDPTRs and forgot to get back to normal MOV CR3. > > So yeah, the only way to correctly handle this would be to intercept CR3 loads. > I'm guessing that would have a noticeable impact on guest performance. Ouch... yeah, allow_smaller_maxphyaddr already has bad performance, but intercepting CR3 loads would be another kind of slow. Paolo > Paolo, I'll leave this one for you to decide, we have pretty much written off > allow_smaller_maxphyaddr:-)
On Thu, Oct 21, 2021 at 10:13 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 21/10/21 16:52, Sean Christopherson wrote: > >> I think the EPT violation happens*after* the cr3 write. So the instruction to be > >> emulated is not "cr3 write". The emulation will queue fault into guest though, > >> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit. > > Doh, you're correct. I think my mind wandered into thinking about what would > > happen with PDPTRs and forgot to get back to normal MOV CR3. > > > > So yeah, the only way to correctly handle this would be to intercept CR3 loads. > > I'm guessing that would have a noticeable impact on guest performance. > > Ouch... yeah, allow_smaller_maxphyaddr already has bad performance, but > intercepting CR3 loads would be another kind of slow. Can we kill it? It's only half-baked as it is. Or are we committed to it now?
On 2021/10/21 22:52, Sean Christopherson wrote: > On Thu, Oct 21, 2021, Lai Jiangshan wrote: >> >> >> On 2021/10/21 02:26, Sean Christopherson wrote: >>> On Wed, Oct 20, 2021, Lai Jiangshan wrote: >>>> On 2021/10/19 23:25, Sean Christopherson wrote: >>>> I just read some interception policy in vmx.c, if EPT=1 but vmx_need_pf_intercept() >>>> return true for some reasons/configs, #PF is intercepted. But CR3 write is not >>>> intercepted, which means there will be an EPT fault _after_ (IIUC) the CR3 write if >>>> the GPA of the new CR3 exceeds the guest maxphyaddr limit. And kvm queues a fault to >>>> the guest which is also _after_ the CR3 write, but the guest expects the fault before >>>> the write. >>>> >>>> IIUC, it can be fixed by intercepting CR3 write or reversing the CR3 write in EPT >>>> violation handler. >>> >>> KVM implicitly does the latter by emulating the faulting instruction. >>> >>> static int handle_ept_violation(struct kvm_vcpu *vcpu) >>> { >>> ... >>> >>> /* >>> * Check that the GPA doesn't exceed physical memory limits, as that is >>> * a guest page fault. We have to emulate the instruction here, because >>> * if the illegal address is that of a paging structure, then >>> * EPT_VIOLATION_ACC_WRITE bit is set. Alternatively, if supported we >>> * would also use advanced VM-exit information for EPT violations to >>> * reconstruct the page fault error code. >>> */ >>> if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa))) >>> return kvm_emulate_instruction(vcpu, 0); >>> >>> return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); >>> } >>> >>> and injecting a #GP when kvm_set_cr3() fails. >> >> I think the EPT violation happens *after* the cr3 write. So the instruction to be >> emulated is not "cr3 write". The emulation will queue fault into guest though, >> recursive EPT violation happens since the cr3 exceeds maxphyaddr limit. > > Doh, you're correct. I think my mind wandered into thinking about what would > happen with PDPTRs and forgot to get back to normal MOV CR3. > > So yeah, the only way to correctly handle this would be to intercept CR3 loads. > I'm guessing that would have a noticeable impact on guest performance. I think we can detect it in handle_ept_violation() via checking the cr3 value, and make it triple-fault if it is the case, so that the VMM can exit. I don't think any OS would use the reserved bit in CR3 and the corresponding #GP. > > Paolo, I'll leave this one for you to decide, we have pretty much written off > allow_smaller_maxphyaddr :-) >
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c59b63c56af9..06169ed08db0 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1073,6 +1073,16 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid) unsigned long roots_to_free = 0; int i; + /* + * It is very unlikely to reach here when tdp_enabled. But if it is + * the case, the kvm doesn't know whether any TLB for the @pcid is + * cached in the CPU. So just flush the guest instead. + */ + if (unlikely(tdp_enabled)) { + kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu); + return; + } + /* * If neither the current CR3 nor any of the prev_roots use the given * PCID, then nothing needs to be done here because a resync will