Message ID | 20240815123349.729017-2-mlevitsk@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Relax canonical checks on some arch msrs | expand |
On Thu, Aug 15, 2024, Maxim Levitsky wrote: > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ce7c00894f32..2e83f7d74591 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > sizeof(kvm_vcpu_stats_desc), > }; > > + > +/* > + * Most x86 arch MSR values which contain linear addresses like Is it most, or all? I'm guessing all? > + * segment bases, addresses that are used in instructions (e.g SYSENTER), > + * have static canonicality checks, Weird and early line breaks. How about this? /* * The canonicality checks for MSRs that hold linear addresses, e.g. segment * bases, SYSENTER targets, etc., are static, in the sense that they are based * on CPU _support_ for 5-level paging, not the state of CR4.LA57. > + * size of whose depends only on CPU's support for 5-level > + * paging, rather than state of CR4.LA57. > + * > + * In addition to that, some of these MSRS are directly passed > + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest > + * doen't have LA57 enabled in its CPUID, for consistency with > + * CPUs' ucode, it is better to pivot the check around host > + * support for 5 level paging. I think we should elaborate on why it's better. It only takes another line or two, and that way we don't forget the edge cases that make properly emulating guest CPUID a bad idea. * This creates a virtualization hole where a guest writes to passthrough MSRs * may incorrectly succeed if the CPU supports LA57, but the vCPU does not * (because hardware has no awareness of guest CPUID). Do not try to plug this * hole, i.e. emulate the behavior for intercepted accesses, as injecting #GP * depending on whether or not KVM happens to emulate a WRMSR would result in * non-deterministic behavior, and could even allow L2 to crash L1, e.g. if L1 * passes through an MSR to L2, and then tries to save+restore L2's value. */ > + > +static u8 max_host_supported_virt_addr_bits(void) Any objection to dropping the "supported", i.e. going with max_host_virt_addr_bits()? Mostly to shorten the name, but also because "supported" suggests there's software involvement, e.g. the max supported by the kernel/KVM, which isn't the case. If you're ok with the above, I'll fixup when applying.
On Fri, Aug 16, 2024, Sean Christopherson wrote: > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > How about this? > > /* > * The canonicality checks for MSRs that hold linear addresses, e.g. segment > * bases, SYSENTER targets, etc., are static, in the sense that they are based > * on CPU _support_ for 5-level paging, not the state of CR4.LA57. > > > + * size of whose depends only on CPU's support for 5-level > > + * paging, rather than state of CR4.LA57. > > + * > > + * In addition to that, some of these MSRS are directly passed > > + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest > > + * doen't have LA57 enabled in its CPUID, for consistency with > > + * CPUs' ucode, it is better to pivot the check around host > > + * support for 5 level paging. > > I think we should elaborate on why it's better. It only takes another line or > two, and that way we don't forget the edge cases that make properly emulating > guest CPUID a bad idea. > > * This creates a virtualization hole where a guest writes to passthrough MSRs > * may incorrectly succeed if the CPU supports LA57, but the vCPU does not > * (because hardware has no awareness of guest CPUID). Do not try to plug this > * hole, i.e. emulate the behavior for intercepted accesses, as injecting #GP > * depending on whether or not KVM happens to emulate a WRMSR would result in > * non-deterministic behavior, and could even allow L2 to crash L1, e.g. if L1 > * passes through an MSR to L2, and then tries to save+restore L2's value. > */ > > > + > > +static u8 max_host_supported_virt_addr_bits(void) > > Any objection to dropping the "supported", i.e. going with max_host_virt_addr_bits()? > Mostly to shorten the name, but also because "supported" suggests there's software > involvement, e.g. the max supported by the kernel/KVM, which isn't the case. > > If you're ok with the above, I'll fixup when applying. I take that back, I think we're going to need a v4 (see patch 3).
У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише: > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > > index ce7c00894f32..2e83f7d74591 100644 > > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > > > > > > > > sizeof(kvm_vcpu_stats_desc), > > > > > > > > }; > > > > > > > > > > > > > > > > + > > > > > > > > +/* > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like > > > > > > > > Is it most, or all? I'm guessing all? I can't be sure that all of them are like that - there could be some outliers that behave differently. One of the things my work at Intel taught me is that there is nothing consistent in x86 spec, anything is possible and nothing can be assumed. I dealt only with those msrs, that KVM checks for canonicality, therefore I use the word 'most'. There could be other msrs that are not known to me and/or to KVM. I can write 'some' if you prefer. > > > > > > > > > > > > + * segment bases, addresses that are used in instructions (e.g SYSENTER), > > > > > > > > + * have static canonicality checks, > > > > > > > > Weird and early line breaks. > > > > > > > > How about this? > > > > > > > > /* > > > > * The canonicality checks for MSRs that hold linear addresses, e.g. segment > > > > * bases, SYSENTER targets, etc., are static, in the sense that they are based > > > > * on CPU _support_ for 5-level paging, not the state of CR4.LA57. > > > > > > > > > > > > + * size of whose depends only on CPU's support for 5-level > > > > > > > > + * paging, rather than state of CR4.LA57. > > > > > > > > + * > > > > > > > > + * In addition to that, some of these MSRS are directly passed > > > > > > > > + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest > > > > > > > > + * doen't have LA57 enabled in its CPUID, for consistency with > > > > > > > > + * CPUs' ucode, it is better to pivot the check around host > > > > > > > > + * support for 5 level paging. > > > > > > > > I think we should elaborate on why it's better. It only takes another line or > > > > two, and that way we don't forget the edge cases that make properly emulating > > > > guest CPUID a bad idea. OK, will do. > > > > > > > > * This creates a virtualization hole where a guest writes to passthrough MSRs > > > > * may incorrectly succeed if the CPU supports LA57, but the vCPU does not > > > > * (because hardware has no awareness of guest CPUID). Do not try to plug this > > > > * hole, i.e. emulate the behavior for intercepted accesses, as injecting #GP > > > > * depending on whether or not KVM happens to emulate a WRMSR would result in > > > > * non-deterministic behavior, and could even allow L2 to crash L1, e.g. if L1 > > > > * passes through an MSR to L2, and then tries to save+restore L2's value. > > > > */ > > > > > > > > > > > > + > > > > > > > > +static u8 max_host_supported_virt_addr_bits(void) > > > > > > > > Any objection to dropping the "supported", i.e. going with max_host_virt_addr_bits()? > > > > Mostly to shorten the name, but also because "supported" suggests there's software > > > > involvement, e.g. the max supported by the kernel/KVM, which isn't the case. Doesn't matter to me. > > > > > > > > If you're ok with the above, I'll fixup when applying. > > > > Best regards, Maxim Levitsky
У вт, 2024-08-20 у 15:13 +0300, mlevitsk@redhat.com пише: > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише: > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644 > > > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > > > > > > > > > sizeof(kvm_vcpu_stats_desc), > > > > > > > > > }; > > > > > > > > > > > > > > > > > > + > > > > > > > > > +/* > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like > > > > > > > > > > Is it most, or all? I'm guessing all? > > I can't be sure that all of them are like that - there could be some outliers that behave differently. > > One of the things my work at Intel taught me is that there is nothing consistent > in x86 spec, anything is possible and nothing can be assumed. > > I dealt only with those msrs, that KVM checks for canonicality, therefore I use the word > 'most'. There could be other msrs that are not known to me and/or to KVM. > > I can write 'some' if you prefer. Hi, So I did some more reverse engineering and indeed, 'some' is the right word: I audited all places in KVM which check an linear address for being canonical and this is what I found: - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging, its not possible to know what the hardware does. - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel with 5 level paging disabled and then using userspace 'wrmsr' program to set this msr. I attached the bash script that I used - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story, but for some reason the host doesn't suport (not even read) from MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*. Probably the system is not new enough for these. - invpcid instruction. It is exposed to the guest without interception (unless !npt or !ept), and yes, it works just fine on 57-canonical address without CR4.LA57 set.... - invvpid - this one belongs to VMX set, so technically its for nesting although it is run by L1, it is always emulated by KVM, but still executed on the host just with different vpid, so I booted the host without 5 level paging, and patched KVM to avoid canonical check. Also 57-canonical adddress worked just fine, and fully non canonical address failed. and gave a warning in 'invvpid_error' Should I fix all of these too? About fixing the emulator this is what see: emul_is_noncanonical_address __load_segment_descriptor load_segment_descriptor em_lldt em_ltr em_lgdt_lidt While em_lgdt_lidt should be easy to fix because it calls emul_is_noncanonical_address directly, the em_lldt, em_ltr will be harder because these use load_segment_descriptor which calls __load_segment_descriptor which in turn is also used for emulating of far jumps/calls/rets, for which I do believe that canonical check does respect CR4.LA57, but can't be sure either. It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead set RIP to non canonical instruction, and then on first fetch, #GP happens. I'll setup another unit test for this. RIP of the #GP will determine if the instruction failed or the next fetch. Best regards, Maxim Levitsky > > > > > > > > > > > > > > > + * segment bases, addresses that are used in instructions (e.g SYSENTER), > > > > > > > > > + * have static canonicality checks, > > > > > > > > > > Weird and early line breaks. > > > > > > > > > > How about this? > > > > > > > > > > /* > > > > > * The canonicality checks for MSRs that hold linear addresses, e.g. segment > > > > > * bases, SYSENTER targets, etc., are static, in the sense that they are based > > > > > * on CPU _support_ for 5-level paging, not the state of CR4.LA57. > > > > > > > > > > > > > > + * size of whose depends only on CPU's support for 5-level > > > > > > > > > + * paging, rather than state of CR4.LA57. > > > > > > > > > + * > > > > > > > > > + * In addition to that, some of these MSRS are directly passed > > > > > > > > > + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest > > > > > > > > > + * doen't have LA57 enabled in its CPUID, for consistency with > > > > > > > > > + * CPUs' ucode, it is better to pivot the check around host > > > > > > > > > + * support for 5 level paging. > > > > > > > > > > I think we should elaborate on why it's better. It only takes another line or > > > > > two, and that way we don't forget the edge cases that make properly emulating > > > > > guest CPUID a bad idea. > > OK, will do. > > > > > > > > > > > * This creates a virtualization hole where a guest writes to passthrough MSRs > > > > > * may incorrectly succeed if the CPU supports LA57, but the vCPU does not > > > > > * (because hardware has no awareness of guest CPUID). Do not try to plug this > > > > > * hole, i.e. emulate the behavior for intercepted accesses, as injecting #GP > > > > > * depending on whether or not KVM happens to emulate a WRMSR would result in > > > > > * non-deterministic behavior, and could even allow L2 to crash L1, e.g. if L1 > > > > > * passes through an MSR to L2, and then tries to save+restore L2's value. > > > > > */ > > > > > > > > > > > > > > + > > > > > > > > > +static u8 max_host_supported_virt_addr_bits(void) > > > > > > > > > > Any objection to dropping the "supported", i.e. going with max_host_virt_addr_bits()? > > > > > Mostly to shorten the name, but also because "supported" suggests there's software > > > > > involvement, e.g. the max supported by the kernel/KVM, which isn't the case. > > Doesn't matter to me. > > > > > > > > > > > If you're ok with the above, I'll fixup when applying. > > > > > > > Best regards, > Maxim Levitsky
On Wed, Aug 21, 2024, mlevitsk@redhat.com wrote: > У вт, 2024-08-20 у 15:13 +0300, mlevitsk@redhat.com пише: > > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише: > > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644 > > > > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > > > > > > > > > > sizeof(kvm_vcpu_stats_desc), > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > +/* > > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like > > > > > > > > > > > > Is it most, or all? I'm guessing all? > > > > I can't be sure that all of them are like that - there could be some > > outliers that behave differently. > > > > One of the things my work at Intel taught me is that there is nothing > > consistent in x86 spec, anything is possible and nothing can be assumed. > > > > I dealt only with those msrs, that KVM checks for canonicality, therefore I > > use the word 'most'. There could be other msrs that are not known to me > > and/or to KVM. > > > > I can write 'some' if you prefer. > > Hi, > > > So I did some more reverse engineering and indeed, 'some' is the right word: Is it? IIUC, we have yet to find an MSR that honors, CR4.LA57, i.e. it really is "all", so far as we know. > I audited all places in KVM which check an linear address for being canonical > and this is what I found: > > - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging, > its not possible to know what the hardware does. Heh, yeah, but I would be very surprised if MSR_IA32_BNDCFGS didn't follow all other system-ish MSRs. > - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel > with 5 level paging disabled and then using userspace 'wrmsr' program to > set this msr. I attached the bash script that I used > > - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story, > but for some reason the host doesn't suport (not even read) from > MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*. Probably the system is not new > enough for these. > > - invpcid instruction. It is exposed to the guest without interception > (unless !npt or !ept), and yes, it works just fine on 57-canonical address > without CR4.LA57 set.... Did you verify the behavior for the desciptor, the target, or both? I assume the memory operand, i.e. the address of the _descriptor_, honors CR4.LA57, but the target within the descriptor does not. If that assumption is correct, then this code in vm_mmu_invalidate_addr() is broken, as KVM actually needs to do a TLB flush if the address is canonical for the vCPU model, even if it's non-canonical for the current vCPU state. /* It's actually a GPA for vcpu->arch.guest_mmu. */ if (mmu != &vcpu->arch.guest_mmu) { /* INVLPG on a non-canonical address is a NOP according to the SDM. */ if (is_noncanonical_address(addr, vcpu)) return; kvm_x86_call(flush_tlb_gva)(vcpu, addr); } Assuming INVPCID is indicative of how INVLPG and INVVPID behave (they are nops if the _target_ is non-canonical), then the code is broken for INVLPG and INVVPID. And I think it's probably a safe assumption that a TLB flush is needed. E.g. the primary (possible only?) use case for INVVPID with a linear address is to flush TLB entries for a specific GVA=>HPA mapping, and honoring CR4.LA57 would prevent shadowing 5-level paging with a hypervisor that is using 4-level paging for itself. > - invvpid - this one belongs to VMX set, so technically its for nesting > although it is run by L1, it is always emulated by KVM, but still executed on > the host just with different vpid, so I booted the host without 5 level > paging, and patched KVM to avoid canonical check. > > Also 57-canonical adddress worked just fine, and fully non canonical > address failed. and gave a warning in 'invvpid_error' > > Should I fix all of these too? Yeah, though I believe we're at the point where we need to figure out a better naming scheme, because usage of what is currently is_noncanonical_address() will be, by far, in the minority. Hmm, actually, what if we extend the X86EMUL_F_* flags that were added for LAM (and eventually for LASS) to handle the non-canonical checks? That's essentially what LAM does already, there are just a few more flavors we now need to handle. E.g. I think we just need flags for MSRs and segment/DT bases. The only (or at least, most) confusing thing is that LAM/LASS do NOT apply to INVPLG accesses, but are exempt from LA57. But that's an arch oddity, not a problem KVM can solve. diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index 55a18e2f2dcd..6da03a37bdd5 100644 --- a/arch/x86/kvm/kvm_emulate.h +++ b/arch/x86/kvm/kvm_emulate.h @@ -94,6 +94,8 @@ struct x86_instruction_info { #define X86EMUL_F_FETCH BIT(1) #define X86EMUL_F_IMPLICIT BIT(2) #define X86EMUL_F_INVLPG BIT(3) +#define X86EMUL_F_MSR BIT(4) +#define X86EMUL_F_BASE BIT(5) struct x86_emulate_ops { void (*vm_bugged)(struct x86_emulate_ctxt *ctxt); --- And then with that, we can do the below, and have emul_is_noncanonical_address() redirect to is_noncanonical_address() instead of being an open coded equivalent. --- static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu) { return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48; } static inline u8 max_host_virt_addr_bits(void) { return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48; } static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu, unsigned int flags) { if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD)) return !__is_canonical_address(la, max_host_virt_addr_bits()); else return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu)); } --- That will make it _much_ harder to incorrectly use is_noncanonical_address(), as all callers will be forced to specify the emulation type, i.e. there is no automatic, implied default type. Line lengths could get annoying, but with per-type flags, we could do selectively add a few wrappers, e.g. --- static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu) { return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR); } static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu) { return is_noncanonical_address(la, vcpu, X86EMUL_F_BASE); } static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu) { return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG); } --- We wouldn't want wrapper for everything, e.g. to minimize the risk of creating a de factor implicit default, but I think those three, and maybe a code/fetch variant, will cover all but a few users. > About fixing the emulator this is what see: > > emul_is_noncanonical_address > __load_segment_descriptor > load_segment_descriptor > em_lldt > em_ltr > > em_lgdt_lidt > > > > While em_lgdt_lidt should be easy to fix because it calls > emul_is_noncanonical_address directly, Those don't need to be fixed, they are validating the memory operand, not the base of the descriptor, i.e. aren't exempt from CR4.LA57. > the em_lldt, em_ltr will be harder > because these use load_segment_descriptor which calls > __load_segment_descriptor which in turn is also used for emulating of far > jumps/calls/rets, for which I do believe that canonical check does respect > CR4.LA57, but can't be sure either. I'm fairly certain this is a non-issue. CS, DS, ES, and SS have a fixed base of '0' in 64-bit mode, i.e. are completely exempt from canonical checks because the base address is always ignored. And while FS and GS do have base addresses, the segment descriptors themselves can only load 32-bit bases, i.e. _can't_ generate non-canonical addresses. There _is_ an indirect canonical check on the _descriptor_ (the actual descriptor pointed at by the selector, not the memory operand). The SDM is calls this case out in the LFS/LDS docs: If the FS, or GS register is being loaded with a non-NULL segment selector and any of the following is true: the segment selector index is not within descriptor table limits, the memory address of the descriptor is non-canonical ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ and similarly, when using a CALL GATE, the far transfer docs say: If the segment descriptor from a 64-bit call gate is in non-canonical space. And given that those implicit accesses are not subjected to LAM/LASS, I strongly suspect they honor CR4.LA57. So the explicit emul_is_noncanonical_address() check in __load_segment_descriptor() needs to be tagged X86EMUL_F_BASE, but otherwise it all should Just Work (knock wood). > It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead > set RIP to non canonical instruction, and then on first fetch, #GP happens. I doubt this is the case for the final RIP check, especially since ucode does check vmcs.HOST_RIP against vmcs.HOST_CR4.LA57, but it's worth testing to confirm. > I'll setup another unit test for this. RIP of the #GP will determine if the > instruction failed or the next fetch.
У ср, 2024-08-21 у 09:04 -0700, Sean Christopherson пише: > > On Wed, Aug 21, 2024, mlevitsk@redhat.com wrote: > > > > У вт, 2024-08-20 у 15:13 +0300, mlevitsk@redhat.com пише: > > > > > > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише: > > > > > > > > > > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > > > > > > > > > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > > > > > > > > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644 > > > > > > > > > > > > > > > > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > > > > > > > > > > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > > > > > > > > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > > > > > > > > > > > > > > > > > > > > > > sizeof(kvm_vcpu_stats_desc), > > > > > > > > > > > > > > > > > > > > > > }; > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > > > > > > > > +/* > > > > > > > > > > > > > > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like > > > > > > > > > > > > > > > > > > > > > > > > > > > > Is it most, or all? I'm guessing all? > > > > > > > > > > > > I can't be sure that all of them are like that - there could be some > > > > > > outliers that behave differently. > > > > > > > > > > > > One of the things my work at Intel taught me is that there is nothing > > > > > > consistent in x86 spec, anything is possible and nothing can be assumed. > > > > > > > > > > > > I dealt only with those msrs, that KVM checks for canonicality, therefore I > > > > > > use the word 'most'. There could be other msrs that are not known to me > > > > > > and/or to KVM. > > > > > > > > > > > > I can write 'some' if you prefer. > > > > > > > > Hi, > > > > > > > > > > > > So I did some more reverse engineering and indeed, 'some' is the right word: > > > > Is it? IIUC, we have yet to find an MSR that honors, CR4.LA57, i.e. it really > > is "all", so far as we know. This is more or less what I meant to say: I meant to say that I verified only 'some' of the msrs/instructions, and for others I can't know, although it's likely that indeed our theory is correct. > > > > > > I audited all places in KVM which check an linear address for being canonical > > > > and this is what I found: > > > > > > > > - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging, > > > > its not possible to know what the hardware does. > > > > Heh, yeah, but I would be very surprised if MSR_IA32_BNDCFGS didn't follow all > > other system-ish MSRs. > > > > > > - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel > > > > with 5 level paging disabled and then using userspace 'wrmsr' program to > > > > set this msr. I attached the bash script that I used > > > > > > > > - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story, > > > > but for some reason the host doesn't suport (not even read) from > > > > MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*. Probably the system is not new > > > > enough for these. > > > > > > > > - invpcid instruction. It is exposed to the guest without interception > > > > (unless !npt or !ept), and yes, it works just fine on 57-canonical address > > > > without CR4.LA57 set.... > > > > Did you verify the behavior for the desciptor, the target, or both? I assume > > the memory operand, i.e. the address of the _descriptor_, honors CR4.LA57, but > > the target within the descriptor does not. I verified only the target address. I assume that memory fetches do have to honor the CR.LA57. > > > > If that assumption is correct, then this code in vm_mmu_invalidate_addr() is broken, > > as KVM actually needs to do a TLB flush if the address is canonical for the vCPU > > model, even if it's non-canonical for the current vCPU state. > > > > /* It's actually a GPA for vcpu->arch.guest_mmu. */ > > if (mmu != &vcpu->arch.guest_mmu) { > > /* INVLPG on a non-canonical address is a NOP according to the SDM. */ > > if (is_noncanonical_address(addr, vcpu)) > > return; > > > > kvm_x86_call(flush_tlb_gva)(vcpu, addr); > > } > > > > Assuming INVPCID is indicative of how INVLPG and INVVPID behave (they are nops if > > the _target_ is non-canonical), then the code is broken for INVLPG and INVVPID. > > > > And I think it's probably a safe assumption that a TLB flush is needed. E.g. the > > primary (possible only?) use case for INVVPID with a linear address is to flush > > TLB entries for a specific GVA=>HPA mapping, and honoring CR4.LA57 would prevent > > shadowing 5-level paging with a hypervisor that is using 4-level paging for itself. I also think so. > > > > > > - invvpid - this one belongs to VMX set, so technically its for nesting > > > > although it is run by L1, it is always emulated by KVM, but still executed on > > > > the host just with different vpid, so I booted the host without 5 level > > > > paging, and patched KVM to avoid canonical check. > > > > > > > > Also 57-canonical adddress worked just fine, and fully non canonical > > > > address failed. and gave a warning in 'invvpid_error' > > > > > > > > Should I fix all of these too? > > > > Yeah, though I believe we're at the point where we need to figure out a better > > naming scheme, because usage of what is currently is_noncanonical_address() will > > be, by far, in the minority. Yes, I will take this into account. > > > > Hmm, actually, what if we extend the X86EMUL_F_* flags that were added for LAM > > (and eventually for LASS) to handle the non-canonical checks? That's essentially > > what LAM does already, there are just a few more flavors we now need to handle. > > > > E.g. I think we just need flags for MSRs and segment/DT bases. The only (or at > > least, most) confusing thing is that LAM/LASS do NOT apply to INVPLG accesses, > > but are exempt from LA57. But that's an arch oddity, not a problem KVM can solve. > > > > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h > > index 55a18e2f2dcd..6da03a37bdd5 100644 > > --- a/arch/x86/kvm/kvm_emulate.h > > +++ b/arch/x86/kvm/kvm_emulate.h > > @@ -94,6 +94,8 @@ struct x86_instruction_info { > > #define X86EMUL_F_FETCH BIT(1) > > #define X86EMUL_F_IMPLICIT BIT(2) > > #define X86EMUL_F_INVLPG BIT(3) > > +#define X86EMUL_F_MSR BIT(4) > > +#define X86EMUL_F_BASE BIT(5) > > > > struct x86_emulate_ops { > > void (*vm_bugged)(struct x86_emulate_ctxt *ctxt); > > --- > > > > And then with that, we can do the below, and have emul_is_noncanonical_address() > > redirect to is_noncanonical_address() instead of being an open coded equivalent. > > > > --- > > static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu) > > { > > return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48; > > } > > > > static inline u8 max_host_virt_addr_bits(void) > > { > > return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48; > > } This is a good name for this function, thanks. > > > > static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu, > > unsigned int flags) > > { > > if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD)) > > return !__is_canonical_address(la, max_host_virt_addr_bits()); > > else > > return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu)); > > } This can work in principle, although are you OK with using these emulator flags outside of the emulator code? I am asking because the is_noncanonical_address is used in various places across KVM. > > --- > > > > That will make it _much_ harder to incorrectly use is_noncanonical_address(), > > as all callers will be forced to specify the emulation type, i.e. there is no > > automatic, implied default type. > > > > Line lengths could get annoying, but with per-type flags, we could do selectively > > add a few wrappers, e.g. > > > > --- > > static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu) > > { > > return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR); > > } > > > > static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu) > > { > > return is_noncanonical_address(la, vcpu, X86EMUL_F_BASE); > > } > > > > static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu) > > { > > return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG); > > } > > --- > > > > We wouldn't want wrapper for everything, e.g. to minimize the risk of creating a > > de factor implicit default, but I think those three, and maybe a code/fetch > > variant, will cover all but a few users. > > > > > > About fixing the emulator this is what see: > > > > > > > > emul_is_noncanonical_address > > > > __load_segment_descriptor > > > > load_segment_descriptor > > > > em_lldt > > > > em_ltr > > > > > > > > em_lgdt_lidt > > > > > > > > > > > > > > > > While em_lgdt_lidt should be easy to fix because it calls > > > > emul_is_noncanonical_address directly, > > > > Those don't need to be fixed, they are validating the memory operand, not the > > base of the descriptor, i.e. aren't exempt from CR4.LA57. Are you sure? em_lgdt_lidt reads its memory operands (and checks that it is canonical through linearize) with read_descriptor and that is fine because this is memory fetch, and then it checks that the base address within the operand is canonical. This check needs to be updated, as it is possible to load non canonical GDT and IDT base via lgdt/lidt (I tested this). For em_lldt, em_ltr, the check on the system segment descriptor base is in __load_segment_descriptor: ... ret = linear_read_system(ctxt, desc_addr+8, &base3, sizeof(base3)); if (ret != X86EMUL_CONTINUE) return ret; if (emul_is_noncanonical_address(get_desc_base(&seg_desc) | ((u64)base3 << 32), ctxt)) return emulate_gp(ctxt, err_code); ... 64 bases are possible only for system segments, which are TSS, LDT, and call gates/IDT descriptors. We don't emulate IDT fetches in protected mode, and as I found out the hard way after I wrote a unit test to do a call through a call gate, the emulator doesn't support call gates either) Thus I can safely patch __load_segment_descriptor. > > > > > > the em_lldt, em_ltr will be harder > > > > because these use load_segment_descriptor which calls > > > > __load_segment_descriptor which in turn is also used for emulating of far > > > > jumps/calls/rets, for which I do believe that canonical check does respect > > > > CR4.LA57, but can't be sure either. > > > > I'm fairly certain this is a non-issue. CS, DS, ES, and SS have a fixed base of > > '0' in 64-bit mode, i.e. are completely exempt from canonical checks because the > > base address is always ignored. And while FS and GS do have base addresses, the > > segment descriptors themselves can only load 32-bit bases, i.e. _can't_ generate > > non-canonical addresses. Yes - all data segments in 64 bit mode, still have 32 bit base in the GDT/LDT, so this is indeed not an issue. > > > > There _is_ an indirect canonical check on the _descriptor_ (the actual descriptor > > pointed at by the selector, not the memory operand). The SDM is calls this case > > out in the LFS/LDS docs: > > > > If the FS, or GS register is being loaded with a non-NULL segment selector and > > any of the following is true: the segment selector index is not within descriptor > > table limits, the memory address of the descriptor is non-canonical > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I tested that both ltr and lldt do ignore CR4.LA57 by loading from a descriptor which had a non canonical value and CR4.LA57 clear. > > and similarly, when using a CALL GATE, the far transfer docs say: > > > > If the segment descriptor from a 64-bit call gate is in non-canonical space. For the academic reference, when loading offset from a call gate descriptor, the CPU does the canonical check, and it does honour the CR4.LA57. A variation of the below test can prove it: void set_callgate_entry(u16 sel, void *offset, int dpl) { idt_entry_t *e = (idt_entry_t *)&gdt[sel >> 3]; set_desc_entry(e, sizeof *e, offset, read_cs(), 12, dpl); } int main(int argc, char **argv) { u64 gate_offset = (u64)&&gate_entry_point; // toggle below to check various cases: //setup_5level_page_table(); //gate_offset = 0xff4547ceb1600000; set_callgate_entry(FIRST_SPARE_SEL, (void *)gate_offset, 0); struct { u64 offset; u16 sel; } call_target = { 0 /* ignored */, FIRST_SPARE_SEL}; // Perform the far call asm volatile goto ( // KVM_FEP ".byte 0x48; rex.w\n" // optional, just for fun "lcall *%0\n" "jmp %l1\n" : : "m" (call_target) : : return_address ); gate_entry_point: printf("Call gate worked\n"); asm volatile ( "retfq\n" ); return_address: printf("Exit from call gate worked\n"); return 0; } > > > > And given that those implicit accesses are not subjected to LAM/LASS, I strongly > > suspect they honor CR4.LA57. So the explicit emul_is_noncanonical_address() > > check in __load_segment_descriptor() needs to be tagged X86EMUL_F_BASE, but > > otherwise it all should Just Work (knock wood). > > > > > > It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead > > > > set RIP to non canonical instruction, and then on first fetch, #GP happens. > > > > I doubt this is the case for the final RIP check, especially since ucode does > > check vmcs.HOST_RIP against vmcs.HOST_CR4.LA57, but it's worth testing to confirm. See above. So now I'll try to do a v4 of the patches with all of the feedback included. Thanks! > > > > > > I'll setup another unit test for this. RIP of the #GP will determine if the > > > > instruction failed or the next fetch. > > Best regards, Maxim Levitsky
On Fri, Aug 23, 2024, mlevitsk@redhat.com wrote: > У ср, 2024-08-21 у 09:04 -0700, Sean Christopherson пише: > > > static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu, > > > unsigned int flags) > > > { > > > if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD)) > > > return !__is_canonical_address(la, max_host_virt_addr_bits()); > > > else > > > return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu)); > > > } > > This can work in principle, although are you OK with using these emulator flags > outside of the emulator code? Yep, they're already used in VMX's vmx_get_untagged_addr(). > I am asking because the is_noncanonical_address is used in various places across KVM. ... > > > We wouldn't want wrapper for everything, e.g. to minimize the risk of creating a > > > de factor implicit default, but I think those three, and maybe a code/fetch > > > variant, will cover all but a few users. > > > > > > > > About fixing the emulator this is what see: > > > > > > > > > > emul_is_noncanonical_address > > > > > __load_segment_descriptor > > > > > load_segment_descriptor > > > > > em_lldt > > > > > em_ltr > > > > > > > > > > em_lgdt_lidt > > > > > > > > > > > > > > > > > > > > While em_lgdt_lidt should be easy to fix because it calls > > > > > emul_is_noncanonical_address directly, > > > > > > Those don't need to be fixed, they are validating the memory operand, not the > > > base of the descriptor, i.e. aren't exempt from CR4.LA57. > > Are you sure? Nope. Re-reading what I wrote, I have no idea what code past me was looking at. I even contradicted myself later on: So the explicit emul_is_noncanonical_address() check in __load_segment_descriptor() needs to be tagged X86EMUL_F_BASE, but otherwise it all should Just Work (knock wood). so yeah, ignore this. > em_lgdt_lidt reads its memory operands (and checks that it is canonical through > linearize) with read_descriptor and that is fine because this is memory fetch, > and then it checks that the base address within the operand is canonical. > > This check needs to be updated, as it is possible to load non canonical GDT and IDT > base via lgdt/lidt (I tested this). > > For em_lldt, em_ltr, the check on the system segment descriptor base is > in __load_segment_descriptor: > > ... > ret = linear_read_system(ctxt, desc_addr+8, &base3, sizeof(base3)); > if (ret != X86EMUL_CONTINUE) > return ret; > if (emul_is_noncanonical_address(get_desc_base(&seg_desc) | > ((u64)base3 << 32), ctxt)) > return emulate_gp(ctxt, err_code); > > ... > > > 64 bases are possible only for system segments, which are > TSS, LDT, and call gates/IDT descriptors. > > > We don't emulate IDT fetches in protected mode, and as I found out the hard way after > I wrote a unit test to do a call through a call gate, the emulator doesn't > support call gates either) > > Thus I can safely patch __load_segment_descriptor. Agreed. And thinking more about how this is likely implemented in ucode, this is probably working as intended. The the SDM gives CPUs a _lot_ of leeway: In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros. as does the APM: Long mode defines 64 bits of virtual address, but implementations of the AMD64 architecture may support fewer bits of virtual address. Although implementations might not use all 64 bits of the virtual address, they check bits 63 through the most-significant implemented bit to see if those bits are all zeros or all ones. An address that complies with this property is said to be in canonical address form. If a virtual-memory reference is not in canonical form, the implementation causes a general-protection exception or stack fault. I suspect that CR4.LA_57 is only consulted when the CPU is actually consuming the address, i.e. is (or is about to, e.g. for code fetches) generating a memory access. Heh, and for MPX, the SDM kinda sorta confirms that LA57 is ignored, though I doubt the author of this section intended their words to be taken this way :-) WRMSR to BNDCFGS will #GP if any of the reserved bits of BNDCFGS is not zero or if the base address of the bound directory is not canonical. XRSTOR of BNDCFGU ignores the reserved bits and does not fault if any is non-zero; similarly, it ignores the upper bits of the base address of the bound directory and sign-extends the highest implemented bit of the linear address to guarantee the canonicality ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ of this address. > > > There _is_ an indirect canonical check on the _descriptor_ (the actual descriptor > > > pointed at by the selector, not the memory operand). The SDM is calls this case > > > out in the LFS/LDS docs: > > > > > > If the FS, or GS register is being loaded with a non-NULL segment selector and > > > any of the following is true: the segment selector index is not within descriptor > > > table limits, the memory address of the descriptor is non-canonical > > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > I tested that both ltr and lldt do ignore CR4.LA57 by loading from a descriptor > which had a non canonical value and CR4.LA57 clear.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ce7c00894f32..2e83f7d74591 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { sizeof(kvm_vcpu_stats_desc), }; + +/* + * Most x86 arch MSR values which contain linear addresses like + * segment bases, addresses that are used in instructions (e.g SYSENTER), + * have static canonicality checks, + * size of whose depends only on CPU's support for 5-level + * paging, rather than state of CR4.LA57. + * + * In addition to that, some of these MSRS are directly passed + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest + * doen't have LA57 enabled in its CPUID, for consistency with + * CPUs' ucode, it is better to pivot the check around host + * support for 5 level paging. + */ + +static u8 max_host_supported_virt_addr_bits(void) +{ + return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48; +} + +static bool is_host_noncanonical_msr_value(u64 la) +{ + return !__is_canonical_address(la, max_host_supported_virt_addr_bits()); +} + static struct kmem_cache *x86_emulator_cache; /* @@ -1829,7 +1854,7 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data, case MSR_KERNEL_GS_BASE: case MSR_CSTAR: case MSR_LSTAR: - if (is_noncanonical_address(data, vcpu)) + if (is_host_noncanonical_msr_value(data)) return 1; break; case MSR_IA32_SYSENTER_EIP: @@ -1846,7 +1871,7 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data, * value, and that something deterministic happens if the guest * invokes 64-bit SYSENTER. */ - data = __canonical_address(data, vcpu_virt_addr_bits(vcpu)); + data = __canonical_address(data, max_host_supported_virt_addr_bits()); break; case MSR_TSC_AUX: if (!kvm_is_supported_user_return_msr(MSR_TSC_AUX))
Several x86's arch msrs contain a linear base, and thus must contain a canonical address. This includes FS/GS base, addresses used for SYSENTER and SYSCALL instructions and probably more. As it turns out, when x86 architecture was updated to 5 level paging / 57 bit virtual addresses, these MSRs were allowed to contain a full 57 bit address regardless of the state of CR4.LA57. The main reason behind this decision is that 5 level paging, and even paging itself can be temporarily disabled (e.g by SMM entry) leaving non canonical values in these fields. Another reason is that OS might prepare these fields before it switches to 5 level paging. Experemental tests on a Sapphire Rapids CPU and on a Zen4 CPU confirm this behavior. In addition to that, the Intel ISA extension manual mentions that this might be the architectural behavior: Architecture Instruction Set Extensions and Future Features Programming Reference [1], Chapter 6.4: "CANONICALITY CHECKING FOR DATA ADDRESSES WRITTEN TO CONTROL REGISTERS AND MSRS" "In Processors that support LAM continue to require the addresses written tocontrol registers or MSRs to be 57-bit canonical if the processor supports 5-level paging or 48-bit canonical if it supports only 4-level paging" [1]: https://cdrdv2.intel.com/v1/dl/getContent/671368 Suggested-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> --- arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-)