Message ID | 20240309010929.1403984-6-seanjc@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: VMX: Drop MTRR virtualization, honor guest PAT | expand |
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote: > Unconditionally honor guest PAT on CPUs that support self-snoop, as > Intel has confirmed that CPUs that support self-snoop always snoop caches > and store buffers. I.e. CPUs with self-snoop maintain cache coherency > even in the presence of aliased memtypes, thus there is no need to trust > the guest behaves and only honor PAT as a last resort, as KVM does today. > > Honoring guest PAT is desirable for use cases where the guest has access > to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual > (mediated, for all intents and purposes) GPU is exposed to the guest, along > with buffers that are consumed directly by the physical GPU, i.e. which > can't be proxied by the host to ensure writes from the guest are performed > with the correct memory type for the GPU. > > Cc: Yiwei Zhang <zzyiwei@google.com> > Suggested-by: Yan Zhao <yan.y.zhao@intel.com> > Suggested-by: Kevin Tian <kevin.tian@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 8 +++++--- > arch/x86/kvm/vmx/vmx.c | 10 ++++++---- > 2 files changed, 11 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 403cd8f914cd..7fa514830628 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, > bool kvm_mmu_may_ignore_guest_pat(void) > { > /* > - * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM > + * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does > + * not support self-snoop (or is affected by an erratum), and the VM > * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to > * honor the memtype from the guest's PAT so that guest accesses to > * memory that is DMA'd aren't cached against the guest's wishes. As a > * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, > - * KVM _always_ ignores guest PAT (when EPT is enabled). > + * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE > + * bits in response to non-coherent device (un)registration. > */ > - return shadow_memtype_mask; > + return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask; > } > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 17a8e4fdf9c4..5dc4c24ae203 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > /* > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > - * device attached. Letting the guest control memory types on Intel > - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > - * the guest to behave only as a last resort. > + * device attached and the CPU doesn't support self-snoop. Letting the > + * guest control memory types on Intel CPUs without self-snoop may > + * result in unexpected behavior, and so KVM's (historical) ABI is to > + * trust the guest to behave only as a last resort. > */ > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > + !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn about unsafe before honoring guest memory type. Though it's a KVM's historical ABI, it's not safe in the security perspective because page aliasing without proper cache flush handling on CPUs without self-snoop may open a door for guest to read uninitialized host data. e.g. when there's a noncoherent DMA device attached, and if there's a memory region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's skipped section), then though the guest memory from this memory region is not accessible to noncoherent DMAs, vCPUs can still access this part of guest memory. Then if vCPUs use WC as guest type, it may bypass host's initialization data in cache and read stale data in host, causing information leak. My preference is still to force WB (i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm). Firstly, it's because there're few CPUs with features VMX without self-snoop; Secondly, security takes priority over functionality :) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); > -- > 2.44.0.278.ge034bb2e1d-goog >
On Mon, Mar 11, 2024, Yan Zhao wrote: > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index 17a8e4fdf9c4..5dc4c24ae203 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > > > /* > > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > > - * device attached. Letting the guest control memory types on Intel > > - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > > - * the guest to behave only as a last resort. > > + * device attached and the CPU doesn't support self-snoop. Letting the > > + * guest control memory types on Intel CPUs without self-snoop may > > + * result in unexpected behavior, and so KVM's (historical) ABI is to > > + * trust the guest to behave only as a last resort. > > */ > > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > > + !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn > about unsafe before honoring guest memory type. I don't think it gains us enough to offset the potential pain such a message would bring. Assuming the warning isn't outright ignored, the most likely scenario is that the warning will cause random end users to worry that the setup they've been running for years is broken, when in reality it's probably just fine for their use case. I would be quite surprised if there are people running untrusted workloads on 10+ year old silicon *and* have passthrough devices and non-coherent IOMMUs/DMA. And anyone exposing a device directly to an untrusted workload really should have done their homework. And it's not like we're going to change KVM's historical behavior at this point. > Though it's a KVM's historical ABI, it's not safe in the security perspective > because page aliasing without proper cache flush handling on CPUs without > self-snoop may open a door for guest to read uninitialized host data. > e.g. when there's a noncoherent DMA device attached, and if there's a memory > region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's > skipped section), then though the guest memory from this memory region is not > accessible to noncoherent DMAs, vCPUs can still access this part of guest memory. > Then if vCPUs use WC as guest type, it may bypass host's initialization data in > cache and read stale data in host, causing information leak. > > My preference is still to force WB > (i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of > !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm). > Firstly, it's because there're few CPUs with features VMX without self-snoop; This is unfortunately not true. I don't know the details, but apparently all Intel CPUs before Ivy Bridge had a one or more related errata. /* * Processors which have self-snooping capability can handle conflicting * memory type across CPUs by snooping its own cache. However, there exists * CPU models in which having conflicting memory types still leads to * unpredictable behavior, machine check errors, or hangs. Clear this * feature to prevent its use on machines with known erratas. */ static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c) { switch (c->x86_model) { case INTEL_FAM6_CORE_YONAH: case INTEL_FAM6_CORE2_MEROM: case INTEL_FAM6_CORE2_MEROM_L: case INTEL_FAM6_CORE2_PENRYN: case INTEL_FAM6_CORE2_DUNNINGTON: case INTEL_FAM6_NEHALEM: case INTEL_FAM6_NEHALEM_G: case INTEL_FAM6_NEHALEM_EP: case INTEL_FAM6_NEHALEM_EX: case INTEL_FAM6_WESTMERE: case INTEL_FAM6_WESTMERE_EP: case INTEL_FAM6_SANDYBRIDGE: setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP); } } > Secondly, security takes priority over functionality :) Yeah, but not breaking userspace for setups that have existed for 10+ years takes priority over all of that :-)
> From: Sean Christopherson <seanjc@google.com> > Sent: Tuesday, March 12, 2024 8:26 AM > > On Mon, Mar 11, 2024, Yan Zhao wrote: > > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote: > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > > index 17a8e4fdf9c4..5dc4c24ae203 100644 > > > --- a/arch/x86/kvm/vmx/vmx.c > > > +++ b/arch/x86/kvm/vmx/vmx.c > > > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu > *vcpu, gfn_t gfn, bool is_mmio) > > > > > > /* > > > * Force WB and ignore guest PAT if the VM does NOT have a non- > coherent > > > - * device attached. Letting the guest control memory types on Intel > > > - * CPUs may result in unexpected behavior, and so KVM's ABI is to > trust > > > - * the guest to behave only as a last resort. > > > + * device attached and the CPU doesn't support self-snoop. Letting > the > > > + * guest control memory types on Intel CPUs without self-snoop may > > > + * result in unexpected behavior, and so KVM's (historical) ABI is to > > > + * trust the guest to behave only as a last resort. > > > */ > > > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > > > + !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) > | VMX_EPT_IPAT_BIT; > > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should > warn > > about unsafe before honoring guest memory type. > > I don't think it gains us enough to offset the potential pain such a message > would bring. Assuming the warning isn't outright ignored, the most likely > scenario > is that the warning will cause random end users to worry that the setup > they've > been running for years is broken, when in reality it's probably just fine for > their > use case. Isn't the 'worry' necessary to allow end users evaluate whether "it's probably just fine for their use case"? I saw the old comment already mentioned that doing so may lead to unexpected behaviors. But I'm not sure whether such code-level caveat has been visible enough to end users. > > I would be quite surprised if there are people running untrusted workloads > on > 10+ year old silicon *and* have passthrough devices and non-coherent > IOMMUs/DMA. this is probably true. > And anyone exposing a device directly to an untrusted workload really > should have > done their homework. or they run trusted workloads which might be tampered by virus to exceed the scope of their homework.
On Tue, Mar 12, 2024, Kevin Tian wrote: > > From: Sean Christopherson <seanjc@google.com> > > Sent: Tuesday, March 12, 2024 8:26 AM > > > > On Mon, Mar 11, 2024, Yan Zhao wrote: > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && > > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn > > > about unsafe before honoring guest memory type. > > > > I don't think it gains us enough to offset the potential pain such a > > message would bring. Assuming the warning isn't outright ignored, the most > > likely scenario is that the warning will cause random end users to worry > > that the setup they've been running for years is broken, when in reality > > it's probably just fine for their > > use case. > > Isn't the 'worry' necessary to allow end users evaluate whether "it's > probably just fine for their use case"? Realistically, outside of large scale deployments, no end user is going to be able to make that evaluation, because practically speaking it requires someone with quite low-level hardware knowledge to be able to make that judgment call. And counting by number of human end users (as opposed to number of VMs being run), I am willing to bet that the overwhelming majority of KVM users aren't kernel or systems engineers. Understandably, users tend to be alarmed by (or suspicious of) new warnings that show up. E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*]. And recently, we had an internal bug report filed against KVM because they observed a performance regression when booting a KVM guest, and saw a new message about some CPU vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel. In short, my concern is that adding a new pr_warn() will generate noise for end users *and* for KVM developers/maintainers, because even if we phrase the message to talk specifically about "untrusted workloads", the majority of affected users will not have the necessary knowledge to make an informed decision. [*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de > I saw the old comment already mentioned that doing so may lead to unexpected > behaviors. But I'm not sure whether such code-level caveat has been visible > enough to end users. Another point to consider: KVM is _always_ potentially broken on such CPUs, as KVM forces WB for guest accesses. I.e. KVM will create memory aliasing if the host has guest memory mapped as non-WB in the PAT, without non-coherent DMA exposed to the guest. > > I would be quite surprised if there are people running untrusted workloads > > on 10+ year old silicon *and* have passthrough devices and non-coherent > > IOMMUs/DMA. > > this is probably true. > > > And anyone exposing a device directly to an untrusted workload really > > should have done their homework. > > or they run trusted workloads which might be tampered by virus to > exceed the scope of their homework.
On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote: > On Tue, Mar 12, 2024, Kevin Tian wrote: > > > From: Sean Christopherson <seanjc@google.com> > > > Sent: Tuesday, March 12, 2024 8:26 AM > > > > > > On Mon, Mar 11, 2024, Yan Zhao wrote: > > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) && > > > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn > > > > about unsafe before honoring guest memory type. > > > > > > I don't think it gains us enough to offset the potential pain such a > > > message would bring. Assuming the warning isn't outright ignored, the most > > > likely scenario is that the warning will cause random end users to worry > > > that the setup they've been running for years is broken, when in reality > > > it's probably just fine for their > > > use case. > > > > Isn't the 'worry' necessary to allow end users evaluate whether "it's > > probably just fine for their use case"? > > Realistically, outside of large scale deployments, no end user is going to be able > to make that evaluation, because practically speaking it requires someone with > quite low-level hardware knowledge to be able to make that judgment call. And > counting by number of human end users (as opposed to number of VMs being run), I > am willing to bet that the overwhelming majority of KVM users aren't kernel or > systems engineers. > > Understandably, users tend to be alarmed by (or suspicious of) new warnings that > show up. E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*]. And recently, we had > an internal bug report filed against KVM because they observed a performance > regression when booting a KVM guest, and saw a new message about some CPU > vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel. > > In short, my concern is that adding a new pr_warn() will generate noise for end > users *and* for KVM developers/maintainers, because even if we phrase the message > to talk specifically about "untrusted workloads", the majority of affected users > will not have the necessary knowledge to make an informed decision. > > [*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de > > > I saw the old comment already mentioned that doing so may lead to unexpected > > behaviors. But I'm not sure whether such code-level caveat has been visible > > enough to end users. > What about add a new module parameter to turn on honoring guest for non-coherent DMAs on CPUs without self-snoop? A previous example is VFIO's "allow_unsafe_interrupts" parameter. > Another point to consider: KVM is _always_ potentially broken on such CPUs, as > KVM forces WB for guest accesses. I.e. KVM will create memory aliasing if the > host has guest memory mapped as non-WB in the PAT, without non-coherent DMA > exposed to the guest. In this case, memory aliasing may only lead to guest not function well, since guest is not using WC/UC (which can bypass host initialization data in cache). But if guest has any chance to read information not intended to it, I believe we need to fix it as well. > > > I would be quite surprised if there are people running untrusted workloads > > > on 10+ year old silicon *and* have passthrough devices and non-coherent > > > IOMMUs/DMA. What if the guest is a totally malicious one? Previously we trust the guest in the case of noncoherent DMA is because we believe a malicious guest will only meet data corruption and shoot his own foot. But as Jason raised the security problem in another mail thread [1], this will expose security hole if CPUs have no self-snoop. So, we need to fix it, right? + Jason, in case I didn't understand this problem correctly. [1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/ > > this is probably true. > > > > > And anyone exposing a device directly to an untrusted workload really > > > should have done their homework. > > > > or they run trusted workloads which might be tampered by virus to > > exceed the scope of their homework.
> From: Zhao, Yan Y <yan.y.zhao@intel.com> > Sent: Wednesday, March 13, 2024 9:19 AM > > On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote: > > On Tue, Mar 12, 2024, Kevin Tian wrote: > > > I saw the old comment already mentioned that doing so may lead to > unexpected > > > behaviors. But I'm not sure whether such code-level caveat has been > visible > > > enough to end users. > > > What about add a new module parameter to turn on honoring guest for > non-coherent DMAs on CPUs without self-snoop? > A previous example is VFIO's "allow_unsafe_interrupts" parameter. Not sure whether such parameter has a real value. If it's default 'off' then you break those 10yr+ setups and it's unacceptable. If it's default 'on' then same effect as this patch does then I'm not sure who'd want to turn it off afterwards. Somebody aware of such limitation can simply avoid assigning device w/ non-coherent DMA in VM config file instead of further going to toggle the module parameter to prevent something which he already knows not to do. > > > Another point to consider: KVM is _always_ potentially broken on such > CPUs, as > > KVM forces WB for guest accesses. I.e. KVM will create memory aliasing if > the > > host has guest memory mapped as non-WB in the PAT, without non- > coherent DMA > > exposed to the guest. > In this case, memory aliasing may only lead to guest not function well, since > guest is not using WC/UC (which can bypass host initialization data in cache). > But if guest has any chance to read information not intended to it, I believe > we need to fix it as well. Having cache/memory inconsistent could hurt both guest and host. So in concept forcing WB instead of following host attribute on such CPUs is kind of broken, though in reality we may not see an usage of exposing non-WB memory to guest on those old setups as discussed for virtio-gpu case. > > > > > > I would be quite surprised if there are people running untrusted > workloads > > > > on 10+ year old silicon *and* have passthrough devices and non- > coherent > > > > IOMMUs/DMA. > What if the guest is a totally malicious one? > Previously we trust the guest in the case of noncoherent DMA is because > we believe a malicious guest will only meet data corruption and shoot his > own > foot. > > But as Jason raised the security problem in another mail thread [1], > this will expose security hole if CPUs have no self-snoop. So, we need > to fix it, right? > + Jason, in case I didn't understand this problem correctly. > > [1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/ We'll certain fix the security hole on CPUs w/ self-snoop. In this case CPU accesses are guaranteed to be coherent and the vulnerability can only be exposed via non-coherent DMA which is supposed to be fixed by your coming series. But for old CPUs w/o self-snoop the hole can be exploited using either CPU or non-coherent DMA once the guest PAT is honored. As long as nobody is willing to actually fix the CPU path (is it possible?) I'm kind of convinced by Sean that sustaining the old behavior is probably the best option...
> We'll certain fix the security hole on CPUs w/ self-snoop. In this case > CPU accesses are guaranteed to be coherent and the vulnerability can > only be exposed via non-coherent DMA which is supposed to be fixed > by your coming series. > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU > or non-coherent DMA once the guest PAT is honored. As long as nobody > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced We can cook a patch to check CPU self-snoop and force WB in EPT even for non-coherent DMA if no self-snoop. Then back porting such a patch together with the IOMMU side mitigation for non-coherent DMA. Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of no self-snoop. > by Sean that sustaining the old behavior is probably the best option... Yes, as long as we think exposing secuirty hole on those platforms is acceptable.
On Wed, Mar 13, 2024, Yan Zhao wrote: > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case > > CPU accesses are guaranteed to be coherent and the vulnerability can > > only be exposed via non-coherent DMA which is supposed to be fixed > > by your coming series. > > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU > > or non-coherent DMA once the guest PAT is honored. As long as nobody > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced > We can cook a patch to check CPU self-snoop and force WB in EPT even for > non-coherent DMA if no self-snoop. Then back porting such a patch together > with the IOMMU side mitigation for non-coherent DMA. Please don't. This is a "let sleeping dogs lie" situation. let sleeping dogs lie - avoid interfering in a situation that is currently causing no problems but might do so as a result of such interference. Yes, there is technically a flaw, but we have zero evidence that anyone cares or that it is actually problematic in practice. On the other hand, any functional change we make has a non-zero changes of breaking existing setups that have worked for many years. > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of > no self-snoop. > > > by Sean that sustaining the old behavior is probably the best option... > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. Yes, I think it's acceptable. Obviously not ideal, but given the alternatives, I think it is a reasonable risk. Being 100% secure is simply not possible. Security is often about balancing the risk/threat against the cost. In this case, the risk is low (old hardware, uncommon setup for untrusted guests, small window of opportunity, and limited data exposure), whereas the cost is high (decent chance of breaking existing VMs).
On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote: > On Wed, Mar 13, 2024, Yan Zhao wrote: > > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case > > > CPU accesses are guaranteed to be coherent and the vulnerability can > > > only be exposed via non-coherent DMA which is supposed to be fixed > > > by your coming series. > > > > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU > > > or non-coherent DMA once the guest PAT is honored. As long as nobody > > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced > > We can cook a patch to check CPU self-snoop and force WB in EPT even for > > non-coherent DMA if no self-snoop. Then back porting such a patch together > > with the IOMMU side mitigation for non-coherent DMA. > > Please don't. This is a "let sleeping dogs lie" situation. > > let sleeping dogs lie - avoid interfering in a situation that is currently > causing no problems but might do so as a result of such interference. > > Yes, there is technically a flaw, but we have zero evidence that anyone cares or > that it is actually problematic in practice. On the other hand, any functional > change we make has a non-zero changes of breaking existing setups that have worked > for many years. > > > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of > > no self-snoop. > > > > > by Sean that sustaining the old behavior is probably the best option... > > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. > > Yes, I think it's acceptable. Obviously not ideal, but given the alternatives, > I think it is a reasonable risk. > > Being 100% secure is simply not possible. Security is often about balancing the > risk/threat against the cost. In this case, the risk is low (old hardware, > uncommon setup for untrusted guests, small window of opportunity, and limited > data exposure), whereas the cost is high (decent chance of breaking existing VMs). Ok, thanks for explanation! I still have one last question: if in future there are CPUs with no selfsnoop (for some unknown reason, or just paranoid), do we allow this unsafe honoring of guest memory type for non-coherent DMAs?
On Thu, Mar 14, 2024, Yan Zhao wrote: > On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote: > > On Wed, Mar 13, 2024, Yan Zhao wrote: > > > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case > > > > CPU accesses are guaranteed to be coherent and the vulnerability can > > > > only be exposed via non-coherent DMA which is supposed to be fixed > > > > by your coming series. > > > > > > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU > > > > or non-coherent DMA once the guest PAT is honored. As long as nobody > > > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced > > > We can cook a patch to check CPU self-snoop and force WB in EPT even for > > > non-coherent DMA if no self-snoop. Then back porting such a patch together > > > with the IOMMU side mitigation for non-coherent DMA. > > > > Please don't. This is a "let sleeping dogs lie" situation. > > > > let sleeping dogs lie - avoid interfering in a situation that is currently > > causing no problems but might do so as a result of such interference. > > > > Yes, there is technically a flaw, but we have zero evidence that anyone cares or > > that it is actually problematic in practice. On the other hand, any functional > > change we make has a non-zero changes of breaking existing setups that have worked > > for many years. > > > > > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of > > > no self-snoop. > > > > > > > by Sean that sustaining the old behavior is probably the best option... > > > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. > > > > Yes, I think it's acceptable. Obviously not ideal, but given the alternatives, > > I think it is a reasonable risk. > > > > Being 100% secure is simply not possible. Security is often about balancing the > > risk/threat against the cost. In this case, the risk is low (old hardware, > > uncommon setup for untrusted guests, small window of opportunity, and limited > > data exposure), whereas the cost is high (decent chance of breaking existing VMs). > Ok, thanks for explanation! > I still have one last question: if in future there are CPUs with no selfsnoop > (for some unknown reason, or just paranoid), Don't jinx us :-) > do we allow this unsafe honoring of guest memory type for non-coherent DMAs? Hmm, no? AIUI, the intent is that all future CPUs will support self-snoop. Assuming that's the case, then I think it's fair to put the onus on Intel to not have escapes, i.e. to not end up shipping CPUs with erratas. Then, if an erratum does come along, we can make a decision, e.g. allow older CPUs by default, but require an admin opt-in for new CPUs. But let's just hope that's a problem we never have to deal with. :-D
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote: >Unconditionally honor guest PAT on CPUs that support self-snoop, as >Intel has confirmed that CPUs that support self-snoop always snoop caches >and store buffers. I.e. CPUs with self-snoop maintain cache coherency >even in the presence of aliased memtypes, thus there is no need to trust >the guest behaves and only honor PAT as a last resort, as KVM does today. > >Honoring guest PAT is desirable for use cases where the guest has access >to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual >(mediated, for all intents and purposes) GPU is exposed to the guest, along >with buffers that are consumed directly by the physical GPU, i.e. which >can't be proxied by the host to ensure writes from the guest are performed >with the correct memory type for the GPU. > >Cc: Yiwei Zhang <zzyiwei@google.com> >Suggested-by: Yan Zhao <yan.y.zhao@intel.com> >Suggested-by: Kevin Tian <kevin.tian@intel.com> >Signed-off-by: Sean Christopherson <seanjc@google.com> >--- > arch/x86/kvm/mmu/mmu.c | 8 +++++--- > arch/x86/kvm/vmx/vmx.c | 10 ++++++---- > 2 files changed, 11 insertions(+), 7 deletions(-) > >diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c >index 403cd8f914cd..7fa514830628 100644 >--- a/arch/x86/kvm/mmu/mmu.c >+++ b/arch/x86/kvm/mmu/mmu.c >@@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, > bool kvm_mmu_may_ignore_guest_pat(void) > { > /* >- * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM >+ * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does >+ * not support self-snoop (or is affected by an erratum), and the VM > * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to > * honor the memtype from the guest's PAT so that guest accesses to > * memory that is DMA'd aren't cached against the guest's wishes. As a > * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, >- * KVM _always_ ignores guest PAT (when EPT is enabled). >+ * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE >+ * bits in response to non-coherent device (un)registration. > */ >- return shadow_memtype_mask; >+ return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask; > } > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c >index 17a8e4fdf9c4..5dc4c24ae203 100644 >--- a/arch/x86/kvm/vmx/vmx.c >+++ b/arch/x86/kvm/vmx/vmx.c >@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > /* > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent >- * device attached. Letting the guest control memory types on Intel >- * CPUs may result in unexpected behavior, and so KVM's ABI is to trust >- * the guest to behave only as a last resort. >+ * device attached and the CPU doesn't support self-snoop. Letting the >+ * guest control memory types on Intel CPUs without self-snoop may >+ * result in unexpected behavior, and so KVM's (historical) ABI is to >+ * trust the guest to behave only as a last resort. > */ >- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) >+ if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && >+ !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; W/ this change, guests w/o pass-thru devices can also access UC memory. Locking UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially launch DOS attacks on other CPUs on host. isn't it a problem?
On Mon, Mar 25, 2024, Chao Gao wrote: > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote: > >Unconditionally honor guest PAT on CPUs that support self-snoop, as > >Intel has confirmed that CPUs that support self-snoop always snoop caches > >and store buffers. I.e. CPUs with self-snoop maintain cache coherency > >even in the presence of aliased memtypes, thus there is no need to trust > >the guest behaves and only honor PAT as a last resort, as KVM does today. > > > >Honoring guest PAT is desirable for use cases where the guest has access > >to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual > >(mediated, for all intents and purposes) GPU is exposed to the guest, along > >with buffers that are consumed directly by the physical GPU, i.e. which > >can't be proxied by the host to ensure writes from the guest are performed > >with the correct memory type for the GPU. ... > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > >index 17a8e4fdf9c4..5dc4c24ae203 100644 > >--- a/arch/x86/kvm/vmx/vmx.c > >+++ b/arch/x86/kvm/vmx/vmx.c > >@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > > > /* > > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > >- * device attached. Letting the guest control memory types on Intel > >- * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > >- * the guest to behave only as a last resort. > >+ * device attached and the CPU doesn't support self-snoop. Letting the > >+ * guest control memory types on Intel CPUs without self-snoop may > >+ * result in unexpected behavior, and so KVM's (historical) ABI is to > >+ * trust the guest to behave only as a last resort. > > */ > >- if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > >+ if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > >+ !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > W/ this change, guests w/o pass-thru devices can also access UC memory. Locking > UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially > launch DOS attacks on other CPUs on host. isn't it a problem? Guests can already trigger bus locks with atomic accesses that split cache lines. And SPR adds bus lock detection. So practically speaking, I'm pretty sure ICX is the only CPU where anything close to a novel attack is possible. And FWIW, such an attack is already possible on AMD.
Sean Christopherson <seanjc@google.com> writes: > Unconditionally honor guest PAT on CPUs that support self-snoop, as > Intel has confirmed that CPUs that support self-snoop always snoop caches > and store buffers. I.e. CPUs with self-snoop maintain cache coherency > even in the presence of aliased memtypes, thus there is no need to trust > the guest behaves and only honor PAT as a last resort, as KVM does today. > > Honoring guest PAT is desirable for use cases where the guest has access > to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual > (mediated, for all intents and purposes) GPU is exposed to the guest, along > with buffers that are consumed directly by the physical GPU, i.e. which > can't be proxied by the host to ensure writes from the guest are performed > with the correct memory type for the GPU. > > Cc: Yiwei Zhang <zzyiwei@google.com> > Suggested-by: Yan Zhao <yan.y.zhao@intel.com> > Suggested-by: Kevin Tian <kevin.tian@intel.com> > Signed-off-by: Sean Christopherson <seanjc@google.com> > --- > arch/x86/kvm/mmu/mmu.c | 8 +++++--- > arch/x86/kvm/vmx/vmx.c | 10 ++++++---- > 2 files changed, 11 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 403cd8f914cd..7fa514830628 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, > bool kvm_mmu_may_ignore_guest_pat(void) > { > /* > - * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM > + * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does > + * not support self-snoop (or is affected by an erratum), and the VM > * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to > * honor the memtype from the guest's PAT so that guest accesses to > * memory that is DMA'd aren't cached against the guest's wishes. As a > * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, > - * KVM _always_ ignores guest PAT (when EPT is enabled). > + * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE > + * bits in response to non-coherent device (un)registration. > */ > - return shadow_memtype_mask; > + return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask; > } > > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 17a8e4fdf9c4..5dc4c24ae203 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > /* > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > - * device attached. Letting the guest control memory types on Intel > - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > - * the guest to behave only as a last resort. > + * device attached and the CPU doesn't support self-snoop. Letting the > + * guest control memory types on Intel CPUs without self-snoop may > + * result in unexpected behavior, and so KVM's (historical) ABI is to > + * trust the guest to behave only as a last resort. > */ > - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > + if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > + !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); Necroposting! Turns out that this change broke "bochs-display" driver in QEMU even when the guest is modern (don't ask me 'who the hell uses bochs for modern guests', it was basically a configuration error :-). E.g: $ qemu-kvm -name c10s -nodefaults -smp 4 -machine q35,smm=on,accel=kvm,kernel-irqchip=split -global driver=cfi.pflash01,property=secure,value=on -cpu host -drive id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/var/lib/libvirt/images/c10s.qcow2 -device virtio-blk-pci,id=image2,drive=drive_image2,bootindex=3,bus=pcie.0,addr=0x8 -drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,readonly=on,unit=0 -drive file=/tmp/OVMF_VARS.secboot.fd,if=pflash,format=raw,unit=1 -device ahci,id=ahci0 -vnc :0 -device bochs-display -m 8G -monitor stdio The failure looks like Wayland starting and failing right away, this repeats multiple times but after a number of restarts it may try to pretend to work but then it crashes again. Things go back to normal when the commit is reverted in the host kernel. The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R) Silver 4410Y, Sapphire Rapids). I wish I could give additional details to what exactly happens in the guest but I can't find anything useful in the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by signal 9") and I know too little (nothing?) about how modern Linux graphics stack is organized :-( Cc: Gerd just in case.
> Necroposting! > > Turns out that this change broke "bochs-display" driver in QEMU even > when the guest is modern (don't ask me 'who the hell uses bochs for > modern guests', it was basically a configuration error :-). E.g: qemu stdvga (the default display device) is affected too. bochs-display is effectively stdvga minus the vga compatibility: no text mode, no silly planar video modes. Only linear framebuffers in xrgb format are supported. Both devices are handled by the bochs drm driver. > The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R) > Silver 4410Y, Sapphire Rapids). I wish I could give additional details > to what exactly happens in the guest but I can't find anything useful in > the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by > signal 9") and I know too little (nothing?) about how modern Linux > graphics stack is organized :-( Cc: Gerd just in case. The device has an (purely virtual) pci memory bar for video memory. The drm driver allocates buffer objects in this video memory. They are either used by the kernel directly (fbcon case, which works fine) or mapped into userspace so the wayland / X11 display server can software-render into the buffer (which breaks). take care, Gerd
Gerd Hoffmann <kraxel@redhat.com> writes: >> Necroposting! >> >> Turns out that this change broke "bochs-display" driver in QEMU even >> when the guest is modern (don't ask me 'who the hell uses bochs for >> modern guests', it was basically a configuration error :-). E.g: > > qemu stdvga (the default display device) is affected too. > So far, I was only able to verify that the issue has nothing to do with OVMF and multi-vcpu, it reproduces very well with $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc :0 -device VGA -monitor stdio --no-reboot Comparing traces of working and broken cases, I couldn't find anything suspicious but I may had missed something of course. For now, it seems like a userspace misbehavior resulting in a segfault.
On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote: > Gerd Hoffmann <kraxel@redhat.com> writes: > > >> Necroposting! > >> > >> Turns out that this change broke "bochs-display" driver in QEMU even > >> when the guest is modern (don't ask me 'who the hell uses bochs for > >> modern guests', it was basically a configuration error :-). E.g: > > > > qemu stdvga (the default display device) is affected too. > > > > So far, I was only able to verify that the issue has nothing to do with > OVMF and multi-vcpu, it reproduces very well with > > $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s > -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 > -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 > -vnc :0 -device VGA -monitor stdio --no-reboot > > Comparing traces of working and broken cases, I couldn't find anything > suspicious but I may had missed something of course. For now, it seems > like a userspace misbehavior resulting in a segfault. Guest userspace?
Sean Christopherson <seanjc@google.com> writes: > On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote: >> Gerd Hoffmann <kraxel@redhat.com> writes: >> >> >> Necroposting! >> >> >> >> Turns out that this change broke "bochs-display" driver in QEMU even >> >> when the guest is modern (don't ask me 'who the hell uses bochs for >> >> modern guests', it was basically a configuration error :-). E.g: >> > >> > qemu stdvga (the default display device) is affected too. >> > >> >> So far, I was only able to verify that the issue has nothing to do with >> OVMF and multi-vcpu, it reproduces very well with >> >> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s >> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 >> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 >> -vnc :0 -device VGA -monitor stdio --no-reboot >> >> Comparing traces of working and broken cases, I couldn't find anything >> suspicious but I may had missed something of course. For now, it seems >> like a userspace misbehavior resulting in a segfault. > > Guest userspace? > Yes? :-) As Gerd described, video memory is "mapped into userspace so the wayland / X11 display server can software-render into the buffer" and it seems that wayland gets something unexpected in this memory and crashes.
Vitaly Kuznetsov <vkuznets@redhat.com> writes: > Sean Christopherson <seanjc@google.com> writes: > >> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote: >>> Gerd Hoffmann <kraxel@redhat.com> writes: >>> >>> >> Necroposting! >>> >> >>> >> Turns out that this change broke "bochs-display" driver in QEMU even >>> >> when the guest is modern (don't ask me 'who the hell uses bochs for >>> >> modern guests', it was basically a configuration error :-). E.g: >>> > >>> > qemu stdvga (the default display device) is affected too. >>> > >>> >>> So far, I was only able to verify that the issue has nothing to do with >>> OVMF and multi-vcpu, it reproduces very well with >>> >>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s >>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 >>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 >>> -vnc :0 -device VGA -monitor stdio --no-reboot >>> >>> Comparing traces of working and broken cases, I couldn't find anything >>> suspicious but I may had missed something of course. For now, it seems >>> like a userspace misbehavior resulting in a segfault. >> >> Guest userspace? >> > > Yes? :-) As Gerd described, video memory is "mapped into userspace so > the wayland / X11 display server can software-render into the buffer" > and it seems that wayland gets something unexpected in this memory and > crashes. Also, I don't know if it helps or not, but out of two hunks in 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the issue. I.e. the following is enough to fix things: diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index f18c2d8c7476..733a0c45d1a6 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) /* * Force WB and ignore guest PAT if the VM does NOT have a non-coherent - * device attached and the CPU doesn't support self-snoop. Letting the - * guest control memory types on Intel CPUs without self-snoop may - * result in unexpected behavior, and so KVM's (historical) ABI is to - * trust the guest to behave only as a last resort. + * device attached. Letting the guest control memory types on Intel + * CPUs may result in unexpected behavior, and so KVM's ABI is to trust + * the guest to behave only as a last resort. */ - if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && - !kvm_arch_has_noncoherent_dma(vcpu->kvm)) + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote: > Vitaly Kuznetsov <vkuznets@redhat.com> writes: > > > Sean Christopherson <seanjc@google.com> writes: > > > >> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote: > >>> Gerd Hoffmann <kraxel@redhat.com> writes: > >>> > >>> >> Necroposting! > >>> >> > >>> >> Turns out that this change broke "bochs-display" driver in QEMU even > >>> >> when the guest is modern (don't ask me 'who the hell uses bochs for > >>> >> modern guests', it was basically a configuration error :-). E.g: > >>> > > >>> > qemu stdvga (the default display device) is affected too. > >>> > > >>> > >>> So far, I was only able to verify that the issue has nothing to do with > >>> OVMF and multi-vcpu, it reproduces very well with > >>> > >>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s > >>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 > >>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 > >>> -vnc :0 -device VGA -monitor stdio --no-reboot > >>> > >>> Comparing traces of working and broken cases, I couldn't find anything > >>> suspicious but I may had missed something of course. For now, it seems > >>> like a userspace misbehavior resulting in a segfault. > >> > >> Guest userspace? > >> > > > > Yes? :-) As Gerd described, video memory is "mapped into userspace so > > the wayland / X11 display server can software-render into the buffer" > > and it seems that wayland gets something unexpected in this memory and > > crashes. > > Also, I don't know if it helps or not, but out of two hunks in > 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the > issue. I.e. the following is enough to fix things: > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index f18c2d8c7476..733a0c45d1a6 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > /* > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > - * device attached and the CPU doesn't support self-snoop. Letting the > - * guest control memory types on Intel CPUs without self-snoop may > - * result in unexpected behavior, and so KVM's (historical) ABI is to > - * trust the guest to behave only as a last resort. > + * device attached. Letting the guest control memory types on Intel > + * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > + * the guest to behave only as a last resort. > */ > - if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > - !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); Hmm, that suggests the guest kernel maps the buffer as WC. And looking at the bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC. So it could be that userspace doesn't play nice with WC, but could it also be that the QEMU backend doesn't play nice with WC (on Intel)? Given that this is a purely synthetic device, is there any reason to use UC or WC? I.e. can the bochs driver configure its VRAM buffers to be WB? It doesn't look super easy (the DRM/TTM code has so. many. layers), but it appears doable. Since the device only exists in VMs, it's possible the bochs driver has never run on Intel CPUs with WC memtype. The one thing that confuses and concerns me is that this broke in the first place. KVM has honored guest PAT on SVM since forever, which is why I/we had decent confidence KVM could honor guest PAT on VMX without breaking anything. SVM (NPT) has an explicitlyed document special "WC+" memtype, where guest=WC && host=WB == WC+, and WC+ accesses snoop caches on all CPUs. But per Intel engineers, Intel CPUs with self-snoop are supposed to snoop caches on all processors too. I assume this same setup works fine on AMD/SVM? If so, we probably need to do more digging before fudging around this in the guest.
On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote: > Gerd Hoffmann <kraxel@redhat.com> writes: > > >> Necroposting! > >> > >> Turns out that this change broke "bochs-display" driver in QEMU even > >> when the guest is modern (don't ask me 'who the hell uses bochs for > >> modern guests', it was basically a configuration error :-). E.g: > > > > qemu stdvga (the default display device) is affected too. > > > > So far, I was only able to verify that the issue has nothing to do with > OVMF and multi-vcpu, it reproduces very well with > > $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s > -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 > -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 > -vnc :0 -device VGA -monitor stdio --no-reboot > > Comparing traces of working and broken cases, I couldn't find anything > suspicious but I may had missed something of course. For now, it seems > like a userspace misbehavior resulting in a segfault. Could you please share steps launch the broken guest desktop? (better also with guest kernel version, name of desktop processes, name of X server) Currently, I couldn't reproduce the error with "-device bochs-display" or "-device VGA" locally on a "Coffee Lake-S" test machine. Qemu cmd as below: qemu-system-x86_64 -m 4096 -smp 1 -M q35 -name guest-01 -hda ubuntu22-1.qcow2 -bios /usr/bin/bios.bin -enable-kvm -k en-us -serial stdio -device bochs-display -machine kernel_irqchip=on -cpu host -usb -usbdevice tablet The guest can see a VGA device 00:02.0 Display controller: Device 1234:1111 (rev 02) with driver # readlink /sys/bus/pci/devices/0000\:00\:02.0/driver ../../../bus/pci/drivers/bochs-drm I have tried hardcoding several fields as below: (1) hardcoded the fb_map to wc in the guest driver --- a/drivers/gpu/drm/tiny/bochs.c +++ b/drivers/gpu/drm/tiny/bochs.c @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) if (pci_request_region(pdev, 0, "bochs-drm") != 0) DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); - bochs->fb_map = ioremap(addr, size); + bochs->fb_map = ioremap_wc(addr, size); + printk("bochs wc fb_map=%lx, addr=%lx, size=%lx\n", (unsigned long)bochs->fb_map, (unsigned long)addr, (unsigned long)size); if (bochs->fb_map == NULL) { DRM_ERROR("Cannot map framebuffer\n"); return -ENOMEM; With dmesg as below: [ 7.565840] ioremap wc phys_addr fd000000 size 1000000 to wc [ 7.565856] bochs wc fb_map=ffffc90004000000, addr=fd000000, size=1000000 [ 7.565859] [drm] Found bochs VGA, ID 0xb0c5. [ 7.565861] [drm] Framebuffer size 16384 kB @ 0xfd000000, mmio @ 0xfebd9000. [ 7.591995] [drm] Found EDID data blob. [ 7.603956] [drm] Initialized bochs-drm 1.0.0 20130925 for 0000:00:02.0 on minor 1 [ 7.614263] bochs-drm 0000:00:02.0: [drm] fb1: bochs-drmdrmfb frame buffer device (2) hardcoded the memory type to WC in KVM intel driver. + if (gfn >= 0xfd000 && gfn < 0xfe000) + return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; (3) hardcoded mmap flags to WC for some bo objects for Xorg.
> > > Yes? :-) As Gerd described, video memory is "mapped into userspace so > > > the wayland / X11 display server can software-render into the buffer" > > > and it seems that wayland gets something unexpected in this memory and > > > crashes. > > > > Also, I don't know if it helps or not, but out of two hunks in > > 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the > > issue. I.e. the following is enough to fix things: > > > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > > index f18c2d8c7476..733a0c45d1a6 100644 > > --- a/arch/x86/kvm/vmx/vmx.c > > +++ b/arch/x86/kvm/vmx/vmx.c > > @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) > > > > /* > > * Force WB and ignore guest PAT if the VM does NOT have a non-coherent > > - * device attached and the CPU doesn't support self-snoop. Letting the > > - * guest control memory types on Intel CPUs without self-snoop may > > - * result in unexpected behavior, and so KVM's (historical) ABI is to > > - * trust the guest to behave only as a last resort. > > + * device attached. Letting the guest control memory types on Intel > > + * CPUs may result in unexpected behavior, and so KVM's ABI is to trust > > + * the guest to behave only as a last resort. > > */ > > - if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && > > - !kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > + if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT); > > Hmm, that suggests the guest kernel maps the buffer as WC. And looking at the > bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC. So it > could be that userspace doesn't play nice with WC, but could it also be that the > QEMU backend doesn't play nice with WC (on Intel)? > > Given that this is a purely synthetic device, is there any reason to use UC or WC? Well, sharing code with other (real hardware) drivers is pretty much the only reason. DRM has a set of helper functions to manage vram in pci memory bars (see drm_gem_vram_helper.c, drm_gem_ttm_helper.c). > I.e. can the bochs driver configure its VRAM buffers to be WB? It doesn't look > super easy (the DRM/TTM code has so. many. layers), but it appears doable. Since > the device only exists in VMs, it's possible the bochs driver has never run on > Intel CPUs with WC memtype. Thomas Zimmermann <tzimmermann@suse.de> (Cc'ed) has a drm patch series in flight which switches the bochs driver to a shadow buffer model, i.e. all the buffers visible to fbcon and userspace live in main memory. Display updates are handled via in-kernel memcpy from shadow to vram. The pci memory bar becomes an bochs driver implementation detail not visible outside the driver. This should give the bochs driver the freedom to map vram with whatever attributes work best with kvm, without needing drm changes outside the driver. Of course all this does not help much with current distro kernels broken by this patch ... take care, Gerd
Yan Zhao <yan.y.zhao@intel.com> writes: > On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote: >> Gerd Hoffmann <kraxel@redhat.com> writes: >> >> >> Necroposting! >> >> >> >> Turns out that this change broke "bochs-display" driver in QEMU even >> >> when the guest is modern (don't ask me 'who the hell uses bochs for >> >> modern guests', it was basically a configuration error :-). E.g: >> > >> > qemu stdvga (the default display device) is affected too. >> > >> >> So far, I was only able to verify that the issue has nothing to do with >> OVMF and multi-vcpu, it reproduces very well with >> >> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s >> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 >> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 >> -vnc :0 -device VGA -monitor stdio --no-reboot >> >> Comparing traces of working and broken cases, I couldn't find anything >> suspicious but I may had missed something of course. For now, it seems >> like a userspace misbehavior resulting in a segfault. > Could you please share steps launch the broken guest desktop? > (better also with guest kernel version, name of desktop processes, > name of X server) I think the easiest would be to download the latest Centos Stream 10 iso, e.g: https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso (the link is probably not eternal but should work for a couple weeks, check https://composes.stream.centos.org/stream-10/development/ it it doesn't work anymore). Then, just run it: $ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot and connect to VNC console. To speed things up, pick 'Install Centos Stream 10' in the boot menu to avoid ISO integrity check. With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" commit included, you will see the following on the VNC console: installer tries starting Wayland, crashes and drops back into text console. If you revert the commit and start over, Wayland will normally start and you will see the installer. If the installer environment is inconvenient for debugging, then you can install in text mode (or with the commit reverted :-) and then the same problem will be observed when gdm starts. FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) Silver 4410Y".
On Mon, Sep 02, 2024 at 11:49:43AM +0200, Vitaly Kuznetsov wrote: > Yan Zhao <yan.y.zhao@intel.com> writes: > > > On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote: > >> Gerd Hoffmann <kraxel@redhat.com> writes: > >> > >> >> Necroposting! > >> >> > >> >> Turns out that this change broke "bochs-display" driver in QEMU even > >> >> when the guest is modern (don't ask me 'who the hell uses bochs for > >> >> modern guests', it was basically a configuration error :-). E.g: > >> > > >> > qemu stdvga (the default display device) is affected too. > >> > > >> > >> So far, I was only able to verify that the issue has nothing to do with > >> OVMF and multi-vcpu, it reproduces very well with > >> > >> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s > >> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0 > >> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 > >> -vnc :0 -device VGA -monitor stdio --no-reboot > >> > >> Comparing traces of working and broken cases, I couldn't find anything > >> suspicious but I may had missed something of course. For now, it seems > >> like a userspace misbehavior resulting in a segfault. > > Could you please share steps launch the broken guest desktop? > > (better also with guest kernel version, name of desktop processes, > > name of X server) > > I think the easiest would be to download the latest Centos Stream 10 > iso, e.g: > > https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso > (the link is probably not eternal but should work for a couple weeks, > check https://composes.stream.centos.org/stream-10/development/ it it > doesn't work anymore). > > Then, just run it: > $ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot > > and connect to VNC console. To speed things up, pick 'Install Centos > Stream 10' in the boot menu to avoid ISO integrity check. > > With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" > commit included, you will see the following on the VNC console: > installer tries starting Wayland, crashes and drops back into text > console. If you revert the commit and start over, Wayland will normally > start and you will see the installer. Hmm, looks this issue can't be reproduced on physical machine "Coffee Lake-S". The installer can show up to ask for language selection. But it can be reproduced on the machine "Sapphire Rapids XCC". > If the installer environment is inconvenient for debugging, then you can > install in text mode (or with the commit reverted :-) and then the same > problem will be observed when gdm starts. > Same to the gdm. > FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > Silver 4410Y". Will have a debug to check what's going wrong. Thanks!
On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > Silver 4410Y". Has this been reproduced on any other hardware besides SPR? I.e. did we stumble on another hardware issue?
Sean Christopherson <seanjc@google.com> writes: > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) >> Silver 4410Y". > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > on another hardware issue? Very possible, as according to Yan Zhao this doesn't reproduce on at least "Coffee Lake-S". Let me try to grab some random hardware around and I'll be back with my observations.
On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > >> Silver 4410Y". > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > on another hardware issue? > > Very possible, as according to Yan Zhao this doesn't reproduce on at > least "Coffee Lake-S". Let me try to grab some random hardware around > and I'll be back with my observations. Update some new findings from my side: BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range from 0xfd000000 to 0xfe000000. On "Sapphire Rapids XCC": 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch correctly. i.e. if (gfn >= 0xfd000 && gfn < 0xfe000) { return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; } return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch correctly in this case). 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set this fb_map range as WC, with iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for this fb_map has been reserved as uc- by ioremap(). Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. So, with KVM setting WB (no IPAT) to this fb_map range, the effective memory type is UC- and installer/gdm restarts endlessly. 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. (didn't verify the installer's case as I can't update the driver in that case). The reason is that the ioremap_wc() called during starting GDM will no longer meet conflict and can map guest PAT as WC. WIP to find out why effective UC in fb_map range will make gdm to restart endlessly.
Vitaly Kuznetsov <vkuznets@redhat.com> writes: > Sean Christopherson <seanjc@google.com> writes: > >> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: >>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) >>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) >>> Silver 4410Y". >> >> Has this been reproduced on any other hardware besides SPR? I.e. did we stumble >> on another hardware issue? > > Very possible, as according to Yan Zhao this doesn't reproduce on at > least "Coffee Lake-S". Let me try to grab some random hardware around > and I'll be back with my observations. We have some interesting results :-) In addition to Sapphire Rapids, the issue also reproduces on a Cascade lake CPU (Intel(R) Xeon(R) Silver 4214 CPU) but does NOT reproduce on Skylake (Intel(R) Xeon(R) Gold 5118 CPU). I don't have a lot of desktop CPUs around, so can't say much. AMD also doesn't seem to be affected, at leats AMD EPYC 7413 works fine.
On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > Sean Christopherson <seanjc@google.com> writes: > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > >> Silver 4410Y". > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > on another hardware issue? > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > least "Coffee Lake-S". Let me try to grab some random hardware around > > and I'll be back with my observations. > > Update some new findings from my side: > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > from 0xfd000000 to 0xfe000000. > > On "Sapphire Rapids XCC": > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > correctly. > i.e. > if (gfn >= 0xfd000 && gfn < 0xfe000) { > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > } > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > correctly in this case). > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > this fb_map range as WC, with > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > this fb_map has been reserved as uc- by ioremap(). > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > memory type is UC- and installer/gdm restarts endlessly. > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > (didn't verify the installer's case as I can't update the driver in that case). > > The reason is that the ioremap_wc() called during starting GDM will no longer > meet conflict and can map guest PAT as WC. > > > WIP to find out why effective UC in fb_map range will make gdm to restart > endlessly. Not sure whether it's simply because UC is too slow. T=Test execution time of a selftest in which guest writes to a GPA for 0x1000000UL times | Sapphire Rapids XCC | Coffee Lake-S --------------|----------------------|----------------- KVM UC+IPAT | T=0m4.530s | T=0m0.622s --------------|----------------------|----------------- KVM WC+IPAT | T=0m0.149s | T=0m0.176s --------------|----------------------|----------------- KVM WB+IPAT | T=0m0.148s | T=0m0.148s ------------------------------------------------------
On Wed, Sep 04, 2024, Yan Zhao wrote: > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > > >> Silver 4410Y". > > > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > > on another hardware issue? > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > > least "Coffee Lake-S". Let me try to grab some random hardware around > > > and I'll be back with my observations. > > > > Update some new findings from my side: > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > > from 0xfd000000 to 0xfe000000. > > > > On "Sapphire Rapids XCC": > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > > correctly. > > i.e. > > if (gfn >= 0xfd000 && gfn < 0xfe000) { > > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > } > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > > correctly in this case). > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > > this fb_map range as WC, with > > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > > this fb_map has been reserved as uc- by ioremap(). > > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > > memory type is UC- and installer/gdm restarts endlessly. > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > > (didn't verify the installer's case as I can't update the driver in that case). > > > > The reason is that the ioremap_wc() called during starting GDM will no longer > > meet conflict and can map guest PAT as WC. Huh. The upside of this is that it sounds like there's nothing broken with WC or self-snoop. > > WIP to find out why effective UC in fb_map range will make gdm to restart > > endlessly. > Not sure whether it's simply because UC is too slow. > > T=Test execution time of a selftest in which guest writes to a GPA for > 0x1000000UL times > > | Sapphire Rapids XCC | Coffee Lake-S > --------------|----------------------|----------------- > KVM UC+IPAT | T=0m4.530s | T=0m0.622s Woah. Have you tried testing MOVDIR64 and/or WT? E.g. to see if the problem is with UC specifically, or if it occurs with any accesses that immediately write through to main memory. > --------------|----------------------|----------------- > KVM WC+IPAT | T=0m0.149s | T=0m0.176s > --------------|----------------------|----------------- > KVM WB+IPAT | T=0m0.148s | T=0m0.148s > ------------------------------------------------------
On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote: > On Wed, Sep 04, 2024, Yan Zhao wrote: > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > > > >> Silver 4410Y". > > > > > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > > > on another hardware issue? > > > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > > > least "Coffee Lake-S". Let me try to grab some random hardware around > > > > and I'll be back with my observations. > > > > > > Update some new findings from my side: > > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > > > from 0xfd000000 to 0xfe000000. > > > > > > On "Sapphire Rapids XCC": > > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > > > correctly. > > > i.e. > > > if (gfn >= 0xfd000 && gfn < 0xfe000) { > > > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > } > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > > > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > > > correctly in this case). > > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > > > this fb_map range as WC, with > > > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > > > > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > > > this fb_map has been reserved as uc- by ioremap(). > > > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > > > > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > > > memory type is UC- and installer/gdm restarts endlessly. > > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > > > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > > > (didn't verify the installer's case as I can't update the driver in that case). > > > > > > The reason is that the ioremap_wc() called during starting GDM will no longer > > > meet conflict and can map guest PAT as WC. > > Huh. The upside of this is that it sounds like there's nothing broken with WC > or self-snoop. Considering a different perspective, the fb_map range is used as frame buffer (vram), with the guest writing to this range and the host reading from it. If the issue were related to self-snooping, we would expect the VNC window to display distorted data. However, the observed behavior is that the GDM window shows up correctly for a sec and restarts over and over. So, do you think we can simply fix this issue by calling ioremap_wc() for the frame buffer/vram range in bochs driver, as is commonly done in other gpu drivers? --- a/drivers/gpu/drm/tiny/bochs.c +++ b/drivers/gpu/drm/tiny/bochs.c @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) if (pci_request_region(pdev, 0, "bochs-drm") != 0) DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); - bochs->fb_map = ioremap(addr, size); + bochs->fb_map = ioremap_wc(addr, size); if (bochs->fb_map == NULL) { DRM_ERROR("Cannot map framebuffer\n"); return -ENOMEM; > > > > WIP to find out why effective UC in fb_map range will make gdm to restart > > > endlessly. > > Not sure whether it's simply because UC is too slow. > > > > T=Test execution time of a selftest in which guest writes to a GPA for > > 0x1000000UL times > > > > | Sapphire Rapids XCC | Coffee Lake-S > > --------------|----------------------|----------------- > > KVM UC+IPAT | T=0m4.530s | T=0m0.622s > > Woah. Have you tried testing MOVDIR64 and/or WT? E.g. to see if the problem is > with UC specifically, or if it occurs with any accesses that immediately write > through to main memory. > > > --------------|----------------------|----------------- > > KVM WC+IPAT | T=0m0.149s | T=0m0.176s > > --------------|----------------------|----------------- > > KVM WB+IPAT | T=0m0.148s | T=0m0.148s > > ------------------------------------------------------ I re-run all the tests and collected an averaged data (10 times each) as below (previous data was just a single-run score): T=Test execution time of a selftest in which guest writes to a GPA for 0x1000000UL times with WRITE_ONCE KVM memtype | Sapphire Rapids XCC | Coffee Lake-S -------------|---------------------|---------------- WB+IPAT | T=0.1511s | T=0.1661s -------------|---------------------|---------------- WC+IPAT | T=0.1411s | T=0.1656s -------------|---------------------|---------------- WT+IPAT | T=3.7527s | T=0.6156s -------------|---------------------|---------------- WP+IPAT | T=4.4663s | T=0.6203s -------------|---------------------|---------------- UC+IPAT | T=3.4632s | T=0.5868s T=Test execution time of a selftest in which guest writes to a GPA for 0x1000000UL times with movdir64b. (Coffee Lake-S has no feature movdir64). KVM memtype | Sapphire Rapids XCC | Coffee Lake-S -------------|---------------------|---------------- WB+IPAT | T=2.6142s | / -------------|---------------------|---------------- WC+IPAT | T=2.8919s | / -------------|---------------------|---------------- WT+IPAT | T=3.0966s | / -------------|---------------------|---------------- WP+IPAT | T=2.4933s | / -------------|---------------------|---------------- UC+IPAT | T=3.4606s | /
On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote: > On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote: > > On Wed, Sep 04, 2024, Yan Zhao wrote: > > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > > > > >> Silver 4410Y". > > > > > > > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > > > > on another hardware issue? > > > > > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > > > > least "Coffee Lake-S". Let me try to grab some random hardware around > > > > > and I'll be back with my observations. > > > > > > > > Update some new findings from my side: > > > > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > > > > from 0xfd000000 to 0xfe000000. > > > > > > > > On "Sapphire Rapids XCC": > > > > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > > > > correctly. > > > > i.e. > > > > if (gfn >= 0xfd000 && gfn < 0xfe000) { > > > > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > } > > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > > > > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > > > > correctly in this case). > > > > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > > > > this fb_map range as WC, with > > > > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > > > > > > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > > > > this fb_map has been reserved as uc- by ioremap(). > > > > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > > > > > > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > > > > memory type is UC- and installer/gdm restarts endlessly. > > > > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > > > > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > > > > (didn't verify the installer's case as I can't update the driver in that case). > > > > > > > > The reason is that the ioremap_wc() called during starting GDM will no longer > > > > meet conflict and can map guest PAT as WC. > > > > Huh. The upside of this is that it sounds like there's nothing broken with WC > > or self-snoop. > Considering a different perspective, the fb_map range is used as frame buffer > (vram), with the guest writing to this range and the host reading from it. > If the issue were related to self-snooping, we would expect the VNC window to > display distorted data. However, the observed behavior is that the GDM window > shows up correctly for a sec and restarts over and over. > > So, do you think we can simply fix this issue by calling ioremap_wc() for the > frame buffer/vram range in bochs driver, as is commonly done in other gpu > drivers? > > --- a/drivers/gpu/drm/tiny/bochs.c > +++ b/drivers/gpu/drm/tiny/bochs.c > @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) > if (pci_request_region(pdev, 0, "bochs-drm") != 0) > DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); > > - bochs->fb_map = ioremap(addr, size); > + bochs->fb_map = ioremap_wc(addr, size); > if (bochs->fb_map == NULL) { > DRM_ERROR("Cannot map framebuffer\n"); > return -ENOMEM; > > > > > > > > WIP to find out why effective UC in fb_map range will make gdm to restart > > > > endlessly. > > > Not sure whether it's simply because UC is too slow. > > > > > > T=Test execution time of a selftest in which guest writes to a GPA for > > > 0x1000000UL times > > > > > > | Sapphire Rapids XCC | Coffee Lake-S > > > --------------|----------------------|----------------- > > > KVM UC+IPAT | T=0m4.530s | T=0m0.622s > > > > Woah. Have you tried testing MOVDIR64 and/or WT? E.g. to see if the problem is > > with UC specifically, or if it occurs with any accesses that immediately write > > through to main memory. > > > > > --------------|----------------------|----------------- > > > KVM WC+IPAT | T=0m0.149s | T=0m0.176s > > > --------------|----------------------|----------------- > > > KVM WB+IPAT | T=0m0.148s | T=0m0.148s > > > ------------------------------------------------------ > > I re-run all the tests and collected an averaged data (10 times each) as > below (previous data was just a single-run score): > > > T=Test execution time of a selftest in which guest writes to a GPA for > 0x1000000UL times with WRITE_ONCE > > KVM memtype | Sapphire Rapids XCC | Coffee Lake-S > -------------|---------------------|---------------- > WB+IPAT | T=0.1511s | T=0.1661s > -------------|---------------------|---------------- > WC+IPAT | T=0.1411s | T=0.1656s > -------------|---------------------|---------------- > WT+IPAT | T=3.7527s | T=0.6156s > -------------|---------------------|---------------- > WP+IPAT | T=4.4663s | T=0.6203s > -------------|---------------------|---------------- > UC+IPAT | T=3.4632s | T=0.5868s > > > T=Test execution time of a selftest in which guest writes to a GPA for > 0x1000000UL times with movdir64b. > > (Coffee Lake-S has no feature movdir64). > > KVM memtype | Sapphire Rapids XCC | Coffee Lake-S > -------------|---------------------|---------------- > WB+IPAT | T=2.6142s | / > -------------|---------------------|---------------- > WC+IPAT | T=2.8919s | / > -------------|---------------------|---------------- > WT+IPAT | T=3.0966s | / > -------------|---------------------|---------------- > WP+IPAT | T=2.4933s | / > -------------|---------------------|---------------- > UC+IPAT | T=3.4606s | / > Up to now, I think I have root caused this issue. Status before this update: In either ubuntu or centos, on "Sapphire Rapids XCC" - gdm fails to launch gnome-shell when wayland is enabled, when effective memory type is UC/UC-. - gdm is able launch gnome-shell correctly when wayland is enabled, when effective memory type is WB or WC. - gdm is able launch gnome-shell correctly when wayland is not enabled, with any effective memory type. Update: 1. I tried KVM memtype = WT + IPAT for this framebuffer range, gdm fails to launch gnome-shell when wayland is enabled. Since the only difference between WT and WB is that write in WT is slow, the failure should not be self-snoop issue. 2. The current bochs driver calls ioremap() to map framebuffer range. On x86 architectures, ioremap() maps VA with PAT=UC- and invokes memtype_reserve() to reserve the memory type as UC- for the physical range. This reservation can cause subsequent calls to ioremap_wc() to fail to map the VA with PAT=WC to the same framebuffer range in ttm_kmap_iter_linear_io_init(). Consequently, the operation drm_gem_vram_bo_driver_move() become significantly slow on platforms where UC memory access is slow. When host KVM honors guest PAT memory types, the effective memory type for this framebuffer range is - WC when ioremap_wc() is used in driver probing phase - UC- when ioremap() is used. I measured the data below for drm_gem_vram_bo_driver_move() which does memset to this framebuffer range with size 0x3e8000. --------------------------------------------------------------- | | in bochs_hw_init() | | | ioremap() | ioremap_wc() | |-------------------------------|----------------|--------------| | cycles of | 2227.4M | 17.8M | | drm_gem_vram_bo_driver_move() | | | |-------------------------------|----------------|--------------| | time of | 1.24s | 0.01s | | drm_gem_vram_bo_driver_move() | | | --------------------------------------------------------------- drm_gem_vram_bo_driver_move ttm_bo_move_memcpy() ttm_kmap_iter_linear_io_init() iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset,mem->size)); ttm_move_memcpy memset_io or drm_memcpy_from_wc If I comment out the memset_io() and drm_memcpy_from_wc() in ttm_move_memcpy(), drm_gem_vram_bo_driver_move() can be very fast and gdm is able to launch gnome-shell and login successfully, though sometime the screen is a little blurred. 3. I sent a fix at [1] to let guest bochs driver map the framebuffer with PAT=WC for kernel access. [1] https://lore.kernel.org/all/20240909051529.26776-1-yan.y.zhao@intel.com/
On 9/9/24 07:30, Yan Zhao wrote: > On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote: >> On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote: >>> On Wed, Sep 04, 2024, Yan Zhao wrote: >>>> On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: >>>>> On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: >>>>>> Sean Christopherson <seanjc@google.com> writes: >>>>>> >>>>>>> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: >>>>>>>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) >>>>>>>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) >>>>>>>> Silver 4410Y". >>>>>>> >>>>>>> Has this been reproduced on any other hardware besides SPR? I.e. did we stumble >>>>>>> on another hardware issue? >>>>>> >>>>>> Very possible, as according to Yan Zhao this doesn't reproduce on at >>>>>> least "Coffee Lake-S". Let me try to grab some random hardware around >>>>>> and I'll be back with my observations. >>>>> >>>>> Update some new findings from my side: >>>>> >>>>> BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range >>>>> from 0xfd000000 to 0xfe000000. >>>>> >>>>> On "Sapphire Rapids XCC": >>>>> >>>>> 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch >>>>> correctly. >>>>> i.e. >>>>> if (gfn >= 0xfd000 && gfn < 0xfe000) { >>>>> return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; >>>>> } >>>>> return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; >>>>> >>>>> 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm >>>>> restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch >>>>> correctly in this case). >>>>> >>>>> 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set >>>>> this fb_map range as WC, with >>>>> iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); >>>>> >>>>> However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for >>>>> this fb_map has been reserved as uc- by ioremap(). >>>>> Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. >>>>> >>>>> So, with KVM setting WB (no IPAT) to this fb_map range, the effective >>>>> memory type is UC- and installer/gdm restarts endlessly. >>>>> >>>>> 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver >>>>> to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. >>>>> (didn't verify the installer's case as I can't update the driver in that case). >>>>> >>>>> The reason is that the ioremap_wc() called during starting GDM will no longer >>>>> meet conflict and can map guest PAT as WC. >>> >>> Huh. The upside of this is that it sounds like there's nothing broken with WC >>> or self-snoop. >> Considering a different perspective, the fb_map range is used as frame buffer >> (vram), with the guest writing to this range and the host reading from it. >> If the issue were related to self-snooping, we would expect the VNC window to >> display distorted data. However, the observed behavior is that the GDM window >> shows up correctly for a sec and restarts over and over. >> >> So, do you think we can simply fix this issue by calling ioremap_wc() for the >> frame buffer/vram range in bochs driver, as is commonly done in other gpu >> drivers? >> >> --- a/drivers/gpu/drm/tiny/bochs.c >> +++ b/drivers/gpu/drm/tiny/bochs.c >> @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) >> if (pci_request_region(pdev, 0, "bochs-drm") != 0) >> DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); >> >> - bochs->fb_map = ioremap(addr, size); >> + bochs->fb_map = ioremap_wc(addr, size); >> if (bochs->fb_map == NULL) { >> DRM_ERROR("Cannot map framebuffer\n"); >> return -ENOMEM; While this is a fix for future kernels, it doesn't change the result for VMs already in existence. I don't think there's an alternative to putting this behind a quirk. Paolo
On Mon, Sep 09, 2024, Paolo Bonzini wrote: > On 9/9/24 07:30, Yan Zhao wrote: > > On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote: > > > On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote: > > > > On Wed, Sep 04, 2024, Yan Zhao wrote: > > > > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote: > > > > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote: > > > > > > > Sean Christopherson <seanjc@google.com> writes: > > > > > > > > > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote: > > > > > > > > > FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64) > > > > > > > > > but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R) > > > > > > > > > Silver 4410Y". > > > > > > > > > > > > > > > > Has this been reproduced on any other hardware besides SPR? I.e. did we stumble > > > > > > > > on another hardware issue? > > > > > > > > > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at > > > > > > > least "Coffee Lake-S". Let me try to grab some random hardware around > > > > > > > and I'll be back with my observations. > > > > > > > > > > > > Update some new findings from my side: > > > > > > > > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range > > > > > > from 0xfd000000 to 0xfe000000. > > > > > > > > > > > > On "Sapphire Rapids XCC": > > > > > > > > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch > > > > > > correctly. > > > > > > i.e. > > > > > > if (gfn >= 0xfd000 && gfn < 0xfe000) { > > > > > > return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; > > > > > > } > > > > > > return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT; > > > > > > > > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm > > > > > > restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch > > > > > > correctly in this case). > > > > > > > > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set > > > > > > this fb_map range as WC, with > > > > > > iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size)); > > > > > > > > > > > > However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for > > > > > > this fb_map has been reserved as uc- by ioremap(). > > > > > > Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-. > > > > > > > > > > > > So, with KVM setting WB (no IPAT) to this fb_map range, the effective > > > > > > memory type is UC- and installer/gdm restarts endlessly. > > > > > > > > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver > > > > > > to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly. > > > > > > (didn't verify the installer's case as I can't update the driver in that case). > > > > > > > > > > > > The reason is that the ioremap_wc() called during starting GDM will no longer > > > > > > meet conflict and can map guest PAT as WC. > > > > > > > > Huh. The upside of this is that it sounds like there's nothing broken with WC > > > > or self-snoop. > > > Considering a different perspective, the fb_map range is used as frame buffer > > > (vram), with the guest writing to this range and the host reading from it. > > > If the issue were related to self-snooping, we would expect the VNC window to > > > display distorted data. However, the observed behavior is that the GDM window > > > shows up correctly for a sec and restarts over and over. > > > > > > So, do you think we can simply fix this issue by calling ioremap_wc() for the > > > frame buffer/vram range in bochs driver, as is commonly done in other gpu > > > drivers? > > > > > > --- a/drivers/gpu/drm/tiny/bochs.c > > > +++ b/drivers/gpu/drm/tiny/bochs.c > > > @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev) > > > if (pci_request_region(pdev, 0, "bochs-drm") != 0) > > > DRM_WARN("Cannot request framebuffer, boot fb still active?\n"); > > > > > > - bochs->fb_map = ioremap(addr, size); > > > + bochs->fb_map = ioremap_wc(addr, size); > > > if (bochs->fb_map == NULL) { > > > DRM_ERROR("Cannot map framebuffer\n"); > > > return -ENOMEM; > > While this is a fix for future kernels, it doesn't change the result for VMs > already in existence. I would prefer to bottom out on exactly whether or not the SPR/CLX behavior is working as intended. Maybe the ~8x slowdown is just a side effect of any Intel multi-socket/node system, but I think we should get confirmation (inasmuch as possible) that that is indeed the case. E.g. if this is actually a bug in CLX+, then the actions we need to take are different. > I don't think there's an alternative to putting this behind a quirk. This gets a bit weird, which is why I want to bottom out on whether or not CLX and SPR are working as intended. If non-coherent DMA is attached to the VM, then even before this patch KVM would honor guest PAT. I agree that we don't want to break existing setups, but if CLX+SPR are working as intended, then this is inarguably a bochs driver bug, and I would prefer to have the quirk explicitly reference bochs-compatible devices, e.g. in the name and documentation, so that userspace can disable the quirk by default and only leave it enabled if a bochs device is being exposed to the guest.
On Mon, Sep 09, 2024 at 03:24:40PM +0200, Paolo Bonzini wrote: > While this is a fix for future kernels, it doesn't change the result for VMs > already in existence. Though this is the truth, I have concerns that there may be other guest drivers with improper PAT configurations that were previously masked by KVM's force-WB setting. Now that we respect the guest's PAT settings, these misconfigurations could lead to degraded performance, potentially perceived as errors, as was observed in the previous VMX unit test and the current Bochs scenario. > I don't think there's an alternative to putting this behind a quirk.
On 30.08.24 11:35, Vitaly Kuznetsov wrote: > Sean Christopherson <seanjc@google.com> writes: > >> Unconditionally honor guest PAT on CPUs that support self-snoop, as >> Intel has confirmed that CPUs that support self-snoop always snoop caches >> and store buffers. I.e. CPUs with self-snoop maintain cache coherency >> even in the presence of aliased memtypes, thus there is no need to trust >> the guest behaves and only honor PAT as a last resort, as KVM does today. >> >> Honoring guest PAT is desirable for use cases where the guest has access >> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual >> (mediated, for all intents and purposes) GPU is exposed to the guest, along >> with buffers that are consumed directly by the physical GPU, i.e. which >> can't be proxied by the host to ensure writes from the guest are performed >> with the correct memory type for the GPU. > > Necroposting! > > Turns out that this change broke "bochs-display" driver in QEMU even > when the guest is modern (don't ask me 'who the hell uses bochs for > modern guests', it was basically a configuration error :-). E.g: > [...] This regression made it to the list of tracked regressions. It seems this thread stalled a while ago. Was this ever fixed? Does not look like it, but I might have missed something. Or is this a regression I should just ignore for one reason or another? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot poke
"Linux regression tracking (Thorsten Leemhuis)" <regressions@leemhuis.info> writes: > On 30.08.24 11:35, Vitaly Kuznetsov wrote: >> Sean Christopherson <seanjc@google.com> writes: >> >>> Unconditionally honor guest PAT on CPUs that support self-snoop, as >>> Intel has confirmed that CPUs that support self-snoop always snoop caches >>> and store buffers. I.e. CPUs with self-snoop maintain cache coherency >>> even in the presence of aliased memtypes, thus there is no need to trust >>> the guest behaves and only honor PAT as a last resort, as KVM does today. >>> >>> Honoring guest PAT is desirable for use cases where the guest has access >>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual >>> (mediated, for all intents and purposes) GPU is exposed to the guest, along >>> with buffers that are consumed directly by the physical GPU, i.e. which >>> can't be proxied by the host to ensure writes from the guest are performed >>> with the correct memory type for the GPU. >> >> Necroposting! >> >> Turns out that this change broke "bochs-display" driver in QEMU even >> when the guest is modern (don't ask me 'who the hell uses bochs for >> modern guests', it was basically a configuration error :-). E.g: >> [...] > > This regression made it to the list of tracked regressions. It seems > this thread stalled a while ago. Was this ever fixed? Does not look like > it, but I might have missed something. Or is this a regression I should > just ignore for one reason or another? > The regression was addressed in by reverting 377b2f359d1f in 6.11 commit 9d70f3fec14421e793ffbc0ec2f739b24e534900 Author: Paolo Bonzini <pbonzini@redhat.com> Date: Sun Sep 15 02:49:33 2024 -0400 Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" Also, there's a (pending) DRM patch fixing it from the guest's side: https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e
On 07.10.24 15:38, Vitaly Kuznetsov wrote: > "Linux regression tracking (Thorsten Leemhuis)" > <regressions@leemhuis.info> writes: > >> On 30.08.24 11:35, Vitaly Kuznetsov wrote: >>> Sean Christopherson <seanjc@google.com> writes: >>> >>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as >>>> Intel has confirmed that CPUs that support self-snoop always snoop caches >>>> and store buffers. I.e. CPUs with self-snoop maintain cache coherency >>>> even in the presence of aliased memtypes, thus there is no need to trust >>>> the guest behaves and only honor PAT as a last resort, as KVM does today. >>>> >>>> Honoring guest PAT is desirable for use cases where the guest has access >>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual >>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along >>>> with buffers that are consumed directly by the physical GPU, i.e. which >>>> can't be proxied by the host to ensure writes from the guest are performed >>>> with the correct memory type for the GPU. >>> >>> Necroposting! >>> >>> Turns out that this change broke "bochs-display" driver in QEMU even >>> when the guest is modern (don't ask me 'who the hell uses bochs for >>> modern guests', it was basically a configuration error :-). E.g: >>> [...] >> >> This regression made it to the list of tracked regressions. It seems >> this thread stalled a while ago. Was this ever fixed? Does not look like >> it, but I might have missed something. Or is this a regression I should >> just ignore for one reason or another? >> > > The regression was addressed in by reverting 377b2f359d1f in 6.11 > > commit 9d70f3fec14421e793ffbc0ec2f739b24e534900 > Author: Paolo Bonzini <pbonzini@redhat.com> > Date: Sun Sep 15 02:49:33 2024 -0400 > > Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop" Thx. Sorry, missed that, thx for pointing me towards it. I had looked for things like that, but seems I messed up my lore query. Apologies for the noise! > Also, there's a (pending) DRM patch fixing it from the guest's side: > https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e Great! Ciao, Thorsten P.S.: #regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 403cd8f914cd..7fa514830628 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu, bool kvm_mmu_may_ignore_guest_pat(void) { /* - * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM + * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does + * not support self-snoop (or is affected by an erratum), and the VM * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to * honor the memtype from the guest's PAT so that guest accesses to * memory that is DMA'd aren't cached against the guest's wishes. As a * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA, - * KVM _always_ ignores guest PAT (when EPT is enabled). + * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE + * bits in response to non-coherent device (un)registration. */ - return shadow_memtype_mask; + return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask; } int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 17a8e4fdf9c4..5dc4c24ae203 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) /* * Force WB and ignore guest PAT if the VM does NOT have a non-coherent - * device attached. Letting the guest control memory types on Intel - * CPUs may result in unexpected behavior, and so KVM's ABI is to trust - * the guest to behave only as a last resort. + * device attached and the CPU doesn't support self-snoop. Letting the + * guest control memory types on Intel CPUs without self-snoop may + * result in unexpected behavior, and so KVM's (historical) ABI is to + * trust the guest to behave only as a last resort. */ - if (!kvm_arch_has_noncoherent_dma(vcpu->kvm)) + if (!static_cpu_has(X86_FEATURE_SELFSNOOP) && + !kvm_arch_has_noncoherent_dma(vcpu->kvm)) return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT; return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
Unconditionally honor guest PAT on CPUs that support self-snoop, as Intel has confirmed that CPUs that support self-snoop always snoop caches and store buffers. I.e. CPUs with self-snoop maintain cache coherency even in the presence of aliased memtypes, thus there is no need to trust the guest behaves and only honor PAT as a last resort, as KVM does today. Honoring guest PAT is desirable for use cases where the guest has access to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual (mediated, for all intents and purposes) GPU is exposed to the guest, along with buffers that are consumed directly by the physical GPU, i.e. which can't be proxied by the host to ensure writes from the guest are performed with the correct memory type for the GPU. Cc: Yiwei Zhang <zzyiwei@google.com> Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Suggested-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> --- arch/x86/kvm/mmu/mmu.c | 8 +++++--- arch/x86/kvm/vmx/vmx.c | 10 ++++++---- 2 files changed, 11 insertions(+), 7 deletions(-)