diff mbox series

[5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop

Message ID 20240309010929.1403984-6-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: VMX: Drop MTRR virtualization, honor guest PAT | expand

Commit Message

Sean Christopherson March 9, 2024, 1:09 a.m. UTC
Unconditionally honor guest PAT on CPUs that support self-snoop, as
Intel has confirmed that CPUs that support self-snoop always snoop caches
and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
even in the presence of aliased memtypes, thus there is no need to trust
the guest behaves and only honor PAT as a last resort, as KVM does today.

Honoring guest PAT is desirable for use cases where the guest has access
to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
(mediated, for all intents and purposes) GPU is exposed to the guest, along
with buffers that are consumed directly by the physical GPU, i.e. which
can't be proxied by the host to ensure writes from the guest are performed
with the correct memory type for the GPU.

Cc: Yiwei Zhang <zzyiwei@google.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c |  8 +++++---
 arch/x86/kvm/vmx/vmx.c | 10 ++++++----
 2 files changed, 11 insertions(+), 7 deletions(-)

Comments

Yan Zhao March 11, 2024, 1:16 a.m. UTC | #1
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> Intel has confirmed that CPUs that support self-snoop always snoop caches
> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> even in the presence of aliased memtypes, thus there is no need to trust
> the guest behaves and only honor PAT as a last resort, as KVM does today.
> 
> Honoring guest PAT is desirable for use cases where the guest has access
> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> with buffers that are consumed directly by the physical GPU, i.e. which
> can't be proxied by the host to ensure writes from the guest are performed
> with the correct memory type for the GPU.
> 
> Cc: Yiwei Zhang <zzyiwei@google.com>
> Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 10 ++++++----
>  2 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 403cd8f914cd..7fa514830628 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 17a8e4fdf9c4..5dc4c24ae203 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> +	 * guest control memory types on Intel CPUs without self-snoop may
> +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> +	 * trust the guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
about unsafe before honoring guest memory type.
Though it's a KVM's historical ABI, it's not safe in the security perspective
because page aliasing without proper cache flush handling on CPUs without
self-snoop may open a door for guest to read uninitialized host data.
e.g. when there's a noncoherent DMA device attached, and if there's a memory
region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
skipped section), then though the guest memory from this memory region is not
accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
Then if vCPUs use WC as guest type, it may bypass host's initialization data in
cache and read stale data in host, causing information leak.

My preference is still to force WB
(i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
!static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
Firstly, it's because there're few CPUs with features VMX without self-snoop;
Secondly, security takes priority over functionality :)

>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> -- 
> 2.44.0.278.ge034bb2e1d-goog
>
Sean Christopherson March 12, 2024, 12:25 a.m. UTC | #2
On Mon, Mar 11, 2024, Yan Zhao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  
> >  	/*
> >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> > -	 * device attached.  Letting the guest control memory types on Intel
> > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> > -	 * the guest to behave only as a last resort.
> > +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> > +	 * guest control memory types on Intel CPUs without self-snoop may
> > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > +	 * trust the guest to behave only as a last resort.
> >  	 */
> > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> about unsafe before honoring guest memory type.

I don't think it gains us enough to offset the potential pain such a message
would bring.  Assuming the warning isn't outright ignored, the most likely scenario
is that the warning will cause random end users to worry that the setup they've
been running for years is broken, when in reality it's probably just fine for their
use case.

I would be quite surprised if there are people running untrusted workloads on
10+ year old silicon *and* have passthrough devices and non-coherent IOMMUs/DMA.
And anyone exposing a device directly to an untrusted workload really should have
done their homework.

And it's not like we're going to change KVM's historical behavior at this point.

> Though it's a KVM's historical ABI, it's not safe in the security perspective
> because page aliasing without proper cache flush handling on CPUs without
> self-snoop may open a door for guest to read uninitialized host data.
> e.g. when there's a noncoherent DMA device attached, and if there's a memory
> region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
> skipped section), then though the guest memory from this memory region is not
> accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
> Then if vCPUs use WC as guest type, it may bypass host's initialization data in
> cache and read stale data in host, causing information leak.
> 
> My preference is still to force WB
> (i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
> !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
> Firstly, it's because there're few CPUs with features VMX without self-snoop;

This is unfortunately not true.  I don't know the details, but apparently all
Intel CPUs before Ivy Bridge had a one or more related errata.

/*
 * Processors which have self-snooping capability can handle conflicting
 * memory type across CPUs by snooping its own cache. However, there exists
 * CPU models in which having conflicting memory types still leads to
 * unpredictable behavior, machine check errors, or hangs. Clear this
 * feature to prevent its use on machines with known erratas.
 */
static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
{
	switch (c->x86_model) {
	case INTEL_FAM6_CORE_YONAH:
	case INTEL_FAM6_CORE2_MEROM:
	case INTEL_FAM6_CORE2_MEROM_L:
	case INTEL_FAM6_CORE2_PENRYN:
	case INTEL_FAM6_CORE2_DUNNINGTON:
	case INTEL_FAM6_NEHALEM:
	case INTEL_FAM6_NEHALEM_G:
	case INTEL_FAM6_NEHALEM_EP:
	case INTEL_FAM6_NEHALEM_EX:
	case INTEL_FAM6_WESTMERE:
	case INTEL_FAM6_WESTMERE_EP:
	case INTEL_FAM6_SANDYBRIDGE:
		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
	}
}

> Secondly, security takes priority over functionality :)

Yeah, but not breaking userspace for setups that have existed for 10+ years takes
priority over all of that :-)
Tian, Kevin March 12, 2024, 7:30 a.m. UTC | #3
> From: Sean Christopherson <seanjc@google.com>
> Sent: Tuesday, March 12, 2024 8:26 AM
> 
> On Mon, Mar 11, 2024, Yan Zhao wrote:
> > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu
> *vcpu, gfn_t gfn, bool is_mmio)
> > >
> > >  	/*
> > >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-
> coherent
> > > -	 * device attached.  Letting the guest control memory types on Intel
> > > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to
> trust
> > > -	 * the guest to behave only as a last resort.
> > > +	 * device attached and the CPU doesn't support self-snoop.  Letting
> the
> > > +	 * guest control memory types on Intel CPUs without self-snoop may
> > > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > > +	 * trust the guest to behave only as a last resort.
> > >  	 */
> > > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
> | VMX_EPT_IPAT_BIT;
> >
> > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should
> warn
> > about unsafe before honoring guest memory type.
> 
> I don't think it gains us enough to offset the potential pain such a message
> would bring.  Assuming the warning isn't outright ignored, the most likely
> scenario
> is that the warning will cause random end users to worry that the setup
> they've
> been running for years is broken, when in reality it's probably just fine for
> their
> use case.

Isn't the 'worry' necessary to allow end users evaluate whether "it's
probably just fine for their use case"?

I saw the old comment already mentioned that doing so may lead to
unexpected behaviors. But I'm not sure whether such code-level
caveat has been visible enough to end users.

> 
> I would be quite surprised if there are people running untrusted workloads
> on
> 10+ year old silicon *and* have passthrough devices and non-coherent
> IOMMUs/DMA.

this is probably true.

> And anyone exposing a device directly to an untrusted workload really
> should have
> done their homework.

or they run trusted workloads which might be tampered by virus to
exceed the scope of their homework. 
Sean Christopherson March 12, 2024, 4:07 p.m. UTC | #4
On Tue, Mar 12, 2024, Kevin Tian wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > Sent: Tuesday, March 12, 2024 8:26 AM
> > 
> > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > about unsafe before honoring guest memory type.
> > 
> > I don't think it gains us enough to offset the potential pain such a
> > message would bring.  Assuming the warning isn't outright ignored, the most
> > likely scenario is that the warning will cause random end users to worry
> > that the setup they've been running for years is broken, when in reality
> > it's probably just fine for their
> > use case.
> 
> Isn't the 'worry' necessary to allow end users evaluate whether "it's
> probably just fine for their use case"?

Realistically, outside of large scale deployments, no end user is going to be able
to make that evaluation, because practically speaking it requires someone with
quite low-level hardware knowledge to be able to make that judgment call.  And
counting by number of human end users (as opposed to number of VMs being run), I
am willing to bet that the overwhelming majority of KVM users aren't kernel or
systems engineers.

Understandably, users tend to be alarmed by (or suspicious of) new warnings that
show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
an internal bug report filed against KVM because they observed a performance
regression when booting a KVM guest, and saw a new message about some CPU
vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.

In short, my concern is that adding a new pr_warn() will generate noise for end
users *and* for KVM developers/maintainers, because even if we phrase the message
to talk specifically about "untrusted workloads", the majority of affected users
will not have the necessary knowledge to make an informed decision.

[*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de

> I saw the old comment already mentioned that doing so may lead to unexpected
> behaviors. But I'm not sure whether such code-level caveat has been visible
> enough to end users.

Another point to consider: KVM is _always_ potentially broken on such CPUs, as
KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
exposed to the guest.

> > I would be quite surprised if there are people running untrusted workloads
> > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > IOMMUs/DMA.
> 
> this is probably true.
> 
> > And anyone exposing a device directly to an untrusted workload really
> > should have done their homework.
> 
> or they run trusted workloads which might be tampered by virus to
> exceed the scope of their homework. 
Yan Zhao March 13, 2024, 1:18 a.m. UTC | #5
On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > Sent: Tuesday, March 12, 2024 8:26 AM
> > > 
> > > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > > about unsafe before honoring guest memory type.
> > > 
> > > I don't think it gains us enough to offset the potential pain such a
> > > message would bring.  Assuming the warning isn't outright ignored, the most
> > > likely scenario is that the warning will cause random end users to worry
> > > that the setup they've been running for years is broken, when in reality
> > > it's probably just fine for their
> > > use case.
> > 
> > Isn't the 'worry' necessary to allow end users evaluate whether "it's
> > probably just fine for their use case"?
> 
> Realistically, outside of large scale deployments, no end user is going to be able
> to make that evaluation, because practically speaking it requires someone with
> quite low-level hardware knowledge to be able to make that judgment call.  And
> counting by number of human end users (as opposed to number of VMs being run), I
> am willing to bet that the overwhelming majority of KVM users aren't kernel or
> systems engineers.
> 
> Understandably, users tend to be alarmed by (or suspicious of) new warnings that
> show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
> an internal bug report filed against KVM because they observed a performance
> regression when booting a KVM guest, and saw a new message about some CPU
> vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.
> 
> In short, my concern is that adding a new pr_warn() will generate noise for end
> users *and* for KVM developers/maintainers, because even if we phrase the message
> to talk specifically about "untrusted workloads", the majority of affected users
> will not have the necessary knowledge to make an informed decision.
> 
> [*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de
> 
> > I saw the old comment already mentioned that doing so may lead to unexpected
> > behaviors. But I'm not sure whether such code-level caveat has been visible
> > enough to end users.
>
What about add a new module parameter to turn on honoring guest for
non-coherent DMAs on CPUs without self-snoop?
A previous example is VFIO's "allow_unsafe_interrupts" parameter.

> Another point to consider: KVM is _always_ potentially broken on such CPUs, as
> KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
> host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
> exposed to the guest.
In this case, memory aliasing may only lead to guest not function well, since
guest is not using WC/UC (which can bypass host initialization data in cache).
But if guest has any chance to read information not intended to it, I believe
we need to fix it as well.


> > > I would be quite surprised if there are people running untrusted workloads
> > > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > > IOMMUs/DMA.
What if the guest is a totally malicious one?
Previously we trust the guest in the case of noncoherent DMA is because
we believe a malicious guest will only meet data corruption and shoot his own
foot.

But as Jason raised the security problem in another mail thread [1],
this will expose security hole if CPUs have no self-snoop. So, we need
to fix it, right?
+ Jason, in case I didn't understand this problem correctly.

[1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

> > this is probably true.
> > 
> > > And anyone exposing a device directly to an untrusted workload really
> > > should have done their homework.
> > 
> > or they run trusted workloads which might be tampered by virus to
> > exceed the scope of their homework. 
Tian, Kevin March 13, 2024, 8:52 a.m. UTC | #6
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Wednesday, March 13, 2024 9:19 AM
> 
> On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> > On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > I saw the old comment already mentioned that doing so may lead to
> unexpected
> > > behaviors. But I'm not sure whether such code-level caveat has been
> visible
> > > enough to end users.
> >
> What about add a new module parameter to turn on honoring guest for
> non-coherent DMAs on CPUs without self-snoop?
> A previous example is VFIO's "allow_unsafe_interrupts" parameter.

Not sure whether such parameter has a real value.

If it's default 'off' then you break those 10yr+ setups and it's unacceptable.

If it's default 'on' then same effect as this patch does then I'm not sure who'd
want to turn it off afterwards. Somebody aware of such limitation can simply
avoid assigning device w/ non-coherent DMA in VM config file instead of 
further going to toggle the module parameter to prevent something which
he already knows not to do.

> 
> > Another point to consider: KVM is _always_ potentially broken on such
> CPUs, as
> > KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if
> the
> > host has guest memory mapped as non-WB in the PAT, without non-
> coherent DMA
> > exposed to the guest.
> In this case, memory aliasing may only lead to guest not function well, since
> guest is not using WC/UC (which can bypass host initialization data in cache).
> But if guest has any chance to read information not intended to it, I believe
> we need to fix it as well.

Having cache/memory inconsistent could hurt both guest and host.

So in concept forcing WB instead of following host attribute on such CPUs
is kind of broken, though in reality we may not see an usage of exposing
non-WB memory to guest on those old setups as discussed for virtio-gpu case.

> 
> 
> > > > I would be quite surprised if there are people running untrusted
> workloads
> > > > on 10+ year old silicon *and* have passthrough devices and non-
> coherent
> > > > IOMMUs/DMA.
> What if the guest is a totally malicious one?
> Previously we trust the guest in the case of noncoherent DMA is because
> we believe a malicious guest will only meet data corruption and shoot his
> own
> foot.
> 
> But as Jason raised the security problem in another mail thread [1],
> this will expose security hole if CPUs have no self-snoop. So, we need
> to fix it, right?
> + Jason, in case I didn't understand this problem correctly.
> 
> [1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

We'll certain fix the security hole on CPUs w/ self-snoop. In this case
CPU accesses are guaranteed to be coherent and the vulnerability can
only be exposed via non-coherent DMA which is supposed to be fixed
by your coming series. 

But for old CPUs w/o self-snoop the hole can be exploited using either CPU
or non-coherent DMA once the guest PAT is honored. As long as nobody
is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
by Sean that sustaining the old behavior is probably the best option...
Yan Zhao March 13, 2024, 8:55 a.m. UTC | #7
> We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> CPU accesses are guaranteed to be coherent and the vulnerability can
> only be exposed via non-coherent DMA which is supposed to be fixed
> by your coming series. 
> 
> But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> or non-coherent DMA once the guest PAT is honored. As long as nobody
> is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
We can cook a patch to check CPU self-snoop and force WB in EPT even for
non-coherent DMA if no self-snoop. Then back porting such a patch together
with the IOMMU side mitigation for non-coherent DMA.

Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
no self-snoop.

> by Sean that sustaining the old behavior is probably the best option...
Yes, as long as we think exposing secuirty hole on those platforms is acceptable.
Sean Christopherson March 13, 2024, 3:09 p.m. UTC | #8
On Wed, Mar 13, 2024, Yan Zhao wrote:
> > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > CPU accesses are guaranteed to be coherent and the vulnerability can
> > only be exposed via non-coherent DMA which is supposed to be fixed
> > by your coming series. 
> > 
> > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> We can cook a patch to check CPU self-snoop and force WB in EPT even for
> non-coherent DMA if no self-snoop. Then back porting such a patch together
> with the IOMMU side mitigation for non-coherent DMA.

Please don't.  This is a "let sleeping dogs lie" situation.

  let sleeping dogs lie - avoid interfering in a situation that is currently
  causing no problems but might do so as a result of such interference.

Yes, there is technically a flaw, but we have zero evidence that anyone cares or
that it is actually problematic in practice.  On the other hand, any functional
change we make has a non-zero changes of breaking existing setups that have worked
for many years. 

> Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> no self-snoop.
> 
> > by Sean that sustaining the old behavior is probably the best option...
> Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 

Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
I think it is a reasonable risk.

Being 100% secure is simply not possible.  Security is often about balancing the
risk/threat against the cost.  In this case, the risk is low (old hardware,
uncommon setup for untrusted guests, small window of opportunity, and limited
data exposure), whereas the cost is high (decent chance of breaking existing VMs).
Yan Zhao March 14, 2024, 12:12 a.m. UTC | #9
On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > by your coming series. 
> > > 
> > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > with the IOMMU side mitigation for non-coherent DMA.
> 
> Please don't.  This is a "let sleeping dogs lie" situation.
> 
>   let sleeping dogs lie - avoid interfering in a situation that is currently
>   causing no problems but might do so as a result of such interference.
> 
> Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> that it is actually problematic in practice.  On the other hand, any functional
> change we make has a non-zero changes of breaking existing setups that have worked
> for many years. 
> 
> > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > no self-snoop.
> > 
> > > by Sean that sustaining the old behavior is probably the best option...
> > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> 
> Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> I think it is a reasonable risk.
> 
> Being 100% secure is simply not possible.  Security is often about balancing the
> risk/threat against the cost.  In this case, the risk is low (old hardware,
> uncommon setup for untrusted guests, small window of opportunity, and limited
> data exposure), whereas the cost is high (decent chance of breaking existing VMs).
Ok, thanks for explanation!
I still have one last question: if in future there are CPUs with no selfsnoop
(for some unknown reason, or just paranoid), do we allow this unsafe honoring of
guest memory type for non-coherent DMAs?
Sean Christopherson March 14, 2024, 1 a.m. UTC | #10
On Thu, Mar 14, 2024, Yan Zhao wrote:
> On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> > On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > > by your coming series. 
> > > > 
> > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > > with the IOMMU side mitigation for non-coherent DMA.
> > 
> > Please don't.  This is a "let sleeping dogs lie" situation.
> > 
> >   let sleeping dogs lie - avoid interfering in a situation that is currently
> >   causing no problems but might do so as a result of such interference.
> > 
> > Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> > that it is actually problematic in practice.  On the other hand, any functional
> > change we make has a non-zero changes of breaking existing setups that have worked
> > for many years. 
> > 
> > > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > > no self-snoop.
> > > 
> > > > by Sean that sustaining the old behavior is probably the best option...
> > > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> > 
> > Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> > I think it is a reasonable risk.
> > 
> > Being 100% secure is simply not possible.  Security is often about balancing the
> > risk/threat against the cost.  In this case, the risk is low (old hardware,
> > uncommon setup for untrusted guests, small window of opportunity, and limited
> > data exposure), whereas the cost is high (decent chance of breaking existing VMs).
> Ok, thanks for explanation!
> I still have one last question: if in future there are CPUs with no selfsnoop
> (for some unknown reason, or just paranoid),

Don't jinx us :-)

> do we allow this unsafe honoring of guest memory type for non-coherent DMAs? 

Hmm, no?  AIUI, the intent is that all future CPUs will support self-snoop.
Assuming that's the case, then I think it's fair to put the onus on Intel to not
have escapes, i.e. to not end up shipping CPUs with erratas.  Then, if an erratum
does come along, we can make a decision, e.g. allow older CPUs by default, but
require an admin opt-in for new CPUs.

But let's just hope that's a problem we never have to deal with.  :-D
Chao Gao March 25, 2024, 3:43 a.m. UTC | #11
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
>Unconditionally honor guest PAT on CPUs that support self-snoop, as
>Intel has confirmed that CPUs that support self-snoop always snoop caches
>and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>even in the presence of aliased memtypes, thus there is no need to trust
>the guest behaves and only honor PAT as a last resort, as KVM does today.
>
>Honoring guest PAT is desirable for use cases where the guest has access
>to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>(mediated, for all intents and purposes) GPU is exposed to the guest, along
>with buffers that are consumed directly by the physical GPU, i.e. which
>can't be proxied by the host to ensure writes from the guest are performed
>with the correct memory type for the GPU.
>
>Cc: Yiwei Zhang <zzyiwei@google.com>
>Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
>Suggested-by: Kevin Tian <kevin.tian@intel.com>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/kvm/mmu/mmu.c |  8 +++++---
> arch/x86/kvm/vmx/vmx.c | 10 ++++++----
> 2 files changed, 11 insertions(+), 7 deletions(-)
>
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index 403cd8f914cd..7fa514830628 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> bool kvm_mmu_may_ignore_guest_pat(void)
> {
> 	/*
>-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
>+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
>+	 * not support self-snoop (or is affected by an erratum), and the VM
> 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> 	 * honor the memtype from the guest's PAT so that guest accesses to
> 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
> 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
>-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
>+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
>+	 * bits in response to non-coherent device (un)registration.
> 	 */
>-	return shadow_memtype_mask;
>+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
> }
> 
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 17a8e4fdf9c4..5dc4c24ae203 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> 
> 	/*
> 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
>-	 * device attached.  Letting the guest control memory types on Intel
>-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
>-	 * the guest to behave only as a last resort.
>+	 * device attached and the CPU doesn't support self-snoop.  Letting the
>+	 * guest control memory types on Intel CPUs without self-snoop may
>+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
>+	 * trust the guest to behave only as a last resort.
> 	 */
>-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
>+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
launch DOS attacks on other CPUs on host. isn't it a problem?
Sean Christopherson April 1, 2024, 10:29 p.m. UTC | #12
On Mon, Mar 25, 2024, Chao Gao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> >Unconditionally honor guest PAT on CPUs that support self-snoop, as
> >Intel has confirmed that CPUs that support self-snoop always snoop caches
> >and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> >even in the presence of aliased memtypes, thus there is no need to trust
> >the guest behaves and only honor PAT as a last resort, as KVM does today.
> >
> >Honoring guest PAT is desirable for use cases where the guest has access
> >to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> >(mediated, for all intents and purposes) GPU is exposed to the guest, along
> >with buffers that are consumed directly by the physical GPU, i.e. which
> >can't be proxied by the host to ensure writes from the guest are performed
> >with the correct memory type for the GPU.

...

> > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 17a8e4fdf9c4..5dc4c24ae203 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > 
> > 	/*
> > 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> >-	 * device attached.  Letting the guest control memory types on Intel
> >-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> >-	 * the guest to behave only as a last resort.
> >+	 * device attached and the CPU doesn't support self-snoop.  Letting the
> >+	 * guest control memory types on Intel CPUs without self-snoop may
> >+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> >+	 * trust the guest to behave only as a last resort.
> > 	 */
> >-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> >+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
> UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
> launch DOS attacks on other CPUs on host. isn't it a problem?

Guests can already trigger bus locks with atomic accesses that split cache lines.
And SPR adds bus lock detection.  So practically speaking, I'm pretty sure ICX is
the only CPU where anything close to a novel attack is possible.  And FWIW, such
an attack is already possible on AMD.
Vitaly Kuznetsov Aug. 30, 2024, 9:35 a.m. UTC | #13
Sean Christopherson <seanjc@google.com> writes:

> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> Intel has confirmed that CPUs that support self-snoop always snoop caches
> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> even in the presence of aliased memtypes, thus there is no need to trust
> the guest behaves and only honor PAT as a last resort, as KVM does today.
>
> Honoring guest PAT is desirable for use cases where the guest has access
> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> with buffers that are consumed directly by the physical GPU, i.e. which
> can't be proxied by the host to ensure writes from the guest are performed
> with the correct memory type for the GPU.
>
> Cc: Yiwei Zhang <zzyiwei@google.com>
> Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 10 ++++++----
>  2 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 403cd8f914cd..7fa514830628 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 17a8e4fdf9c4..5dc4c24ae203 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> +	 * guest control memory types on Intel CPUs without self-snoop may
> +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> +	 * trust the guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);

Necroposting!

Turns out that this change broke "bochs-display" driver in QEMU even
when the guest is modern (don't ask me 'who the hell uses bochs for
modern guests', it was basically a configuration error :-). E.g:

$ qemu-kvm -name c10s -nodefaults -smp 4 -machine
q35,smm=on,accel=kvm,kernel-irqchip=split -global
driver=cfi.pflash01,property=secure,value=on -cpu host -drive
id=drive_image2,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/var/lib/libvirt/images/c10s.qcow2
-device virtio-blk-pci,id=image2,drive=drive_image2,bootindex=3,bus=pcie.0,addr=0x8
-drive file=/usr/share/OVMF/OVMF_CODE.secboot.fd,if=pflash,format=raw,readonly=on,unit=0
-drive file=/tmp/OVMF_VARS.secboot.fd,if=pflash,format=raw,unit=1
-device ahci,id=ahci0 -vnc :0 -device bochs-display -m 8G -monitor stdio

The failure looks like Wayland starting and failing right away, this
repeats multiple times but after a number of restarts it may try to
pretend to work but then it crashes again. Things go back to normal when
the commit is reverted in the host kernel.

The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R)
Silver 4410Y, Sapphire Rapids). I wish I could give additional details
to what exactly happens in the guest but I can't find anything useful in
the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by
signal 9") and I know too little (nothing?) about how modern Linux
graphics stack is organized :-( Cc: Gerd just in case.
Gerd Hoffmann Aug. 30, 2024, 11:05 a.m. UTC | #14
> Necroposting!
> 
> Turns out that this change broke "bochs-display" driver in QEMU even
> when the guest is modern (don't ask me 'who the hell uses bochs for
> modern guests', it was basically a configuration error :-). E.g:

qemu stdvga (the default display device) is affected too.

bochs-display is effectively stdvga minus the vga compatibility: no text
mode, no silly planar video modes.  Only linear framebuffers in xrgb
format are supported.

Both devices are handled by the bochs drm driver.

> The CPU where this reproduces is fairly modern too (Intel(R) Xeon(R)
> Silver 4410Y, Sapphire Rapids). I wish I could give additional details
> to what exactly happens in the guest but I can't find anything useful in
> the logs ("WARNING: Application 'org.gnome.Shell.desktop' killed by
> signal 9") and I know too little (nothing?) about how modern Linux
> graphics stack is organized :-( Cc: Gerd just in case.

The device has an (purely virtual) pci memory bar for video memory.
The drm driver allocates buffer objects in this video memory.  They
are either used by the kernel directly (fbcon case, which works fine)
or mapped into userspace so the wayland / X11 display server can
software-render into the buffer (which breaks).

take care,
  Gerd
Vitaly Kuznetsov Aug. 30, 2024, 1:47 p.m. UTC | #15
Gerd Hoffmann <kraxel@redhat.com> writes:

>> Necroposting!
>> 
>> Turns out that this change broke "bochs-display" driver in QEMU even
>> when the guest is modern (don't ask me 'who the hell uses bochs for
>> modern guests', it was basically a configuration error :-). E.g:
>
> qemu stdvga (the default display device) is affected too.
>

So far, I was only able to verify that the issue has nothing to do with
OVMF and multi-vcpu, it reproduces very well with

$ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
-cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
-device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
-vnc :0 -device VGA -monitor stdio --no-reboot

Comparing traces of working and broken cases, I couldn't find anything
suspicious but I may had missed something of course. For now, it seems
like a userspace misbehavior resulting in a segfault.
Sean Christopherson Aug. 30, 2024, 1:52 p.m. UTC | #16
On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> Gerd Hoffmann <kraxel@redhat.com> writes:
> 
> >> Necroposting!
> >> 
> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> modern guests', it was basically a configuration error :-). E.g:
> >
> > qemu stdvga (the default display device) is affected too.
> >
> 
> So far, I was only able to verify that the issue has nothing to do with
> OVMF and multi-vcpu, it reproduces very well with
> 
> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> Comparing traces of working and broken cases, I couldn't find anything
> suspicious but I may had missed something of course. For now, it seems
> like a userspace misbehavior resulting in a segfault.

Guest userspace?
Vitaly Kuznetsov Aug. 30, 2024, 2:06 p.m. UTC | #17
Sean Christopherson <seanjc@google.com> writes:

> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
>> Gerd Hoffmann <kraxel@redhat.com> writes:
>> 
>> >> Necroposting!
>> >> 
>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>> >> modern guests', it was basically a configuration error :-). E.g:
>> >
>> > qemu stdvga (the default display device) is affected too.
>> >
>> 
>> So far, I was only able to verify that the issue has nothing to do with
>> OVMF and multi-vcpu, it reproduces very well with
>> 
>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>> -vnc :0 -device VGA -monitor stdio --no-reboot
>> 
>> Comparing traces of working and broken cases, I couldn't find anything
>> suspicious but I may had missed something of course. For now, it seems
>> like a userspace misbehavior resulting in a segfault.
>
> Guest userspace?
>

Yes? :-) As Gerd described, video memory is "mapped into userspace so
the wayland / X11 display server can software-render into the buffer"
and it seems that wayland gets something unexpected in this memory and
crashes.
Vitaly Kuznetsov Aug. 30, 2024, 2:37 p.m. UTC | #18
Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Sean Christopherson <seanjc@google.com> writes:
>
>> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
>>> Gerd Hoffmann <kraxel@redhat.com> writes:
>>> 
>>> >> Necroposting!
>>> >> 
>>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>>> >> modern guests', it was basically a configuration error :-). E.g:
>>> >
>>> > qemu stdvga (the default display device) is affected too.
>>> >
>>> 
>>> So far, I was only able to verify that the issue has nothing to do with
>>> OVMF and multi-vcpu, it reproduces very well with
>>> 
>>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>>> -vnc :0 -device VGA -monitor stdio --no-reboot
>>> 
>>> Comparing traces of working and broken cases, I couldn't find anything
>>> suspicious but I may had missed something of course. For now, it seems
>>> like a userspace misbehavior resulting in a segfault.
>>
>> Guest userspace?
>>
>
> Yes? :-) As Gerd described, video memory is "mapped into userspace so
> the wayland / X11 display server can software-render into the buffer"
> and it seems that wayland gets something unexpected in this memory and
> crashes. 

Also, I don't know if it helps or not, but out of two hunks in
377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
issue. I.e. the following is enough to fix things:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f18c2d8c7476..733a0c45d1a6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
        /*
         * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
-        * device attached and the CPU doesn't support self-snoop.  Letting the
-        * guest control memory types on Intel CPUs without self-snoop may
-        * result in unexpected behavior, and so KVM's (historical) ABI is to
-        * trust the guest to behave only as a last resort.
+        * device attached.  Letting the guest control memory types on Intel
+        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
+        * the guest to behave only as a last resort.
         */
-       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
-           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
        return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
Sean Christopherson Aug. 30, 2024, 4:13 p.m. UTC | #19
On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> Vitaly Kuznetsov <vkuznets@redhat.com> writes:
> 
> > Sean Christopherson <seanjc@google.com> writes:
> >
> >> On Fri, Aug 30, 2024, Vitaly Kuznetsov wrote:
> >>> Gerd Hoffmann <kraxel@redhat.com> writes:
> >>> 
> >>> >> Necroposting!
> >>> >> 
> >>> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >>> >> modern guests', it was basically a configuration error :-). E.g:
> >>> >
> >>> > qemu stdvga (the default display device) is affected too.
> >>> >
> >>> 
> >>> So far, I was only able to verify that the issue has nothing to do with
> >>> OVMF and multi-vcpu, it reproduces very well with
> >>> 
> >>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> >>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> >>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> >>> -vnc :0 -device VGA -monitor stdio --no-reboot
> >>> 
> >>> Comparing traces of working and broken cases, I couldn't find anything
> >>> suspicious but I may had missed something of course. For now, it seems
> >>> like a userspace misbehavior resulting in a segfault.
> >>
> >> Guest userspace?
> >>
> >
> > Yes? :-) As Gerd described, video memory is "mapped into userspace so
> > the wayland / X11 display server can software-render into the buffer"
> > and it seems that wayland gets something unexpected in this memory and
> > crashes. 
> 
> Also, I don't know if it helps or not, but out of two hunks in
> 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
> issue. I.e. the following is enough to fix things:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f18c2d8c7476..733a0c45d1a6 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>         /*
>          * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -        * device attached and the CPU doesn't support self-snoop.  Letting the
> -        * guest control memory types on Intel CPUs without self-snoop may
> -        * result in unexpected behavior, and so KVM's (historical) ABI is to
> -        * trust the guest to behave only as a last resort.
> +        * device attached.  Letting the guest control memory types on Intel
> +        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> +        * the guest to behave only as a last resort.
>          */
> -       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> -           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  
>         return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);

Hmm, that suggests the guest kernel maps the buffer as WC.  And looking at the
bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC.  So it
could be that userspace doesn't play nice with WC, but could it also be that the
QEMU backend doesn't play nice with WC (on Intel)?

Given that this is a purely synthetic device, is there any reason to use UC or WC?
I.e. can the bochs driver configure its VRAM buffers to be WB?  It doesn't look
super easy (the DRM/TTM code has so. many. layers), but it appears doable.  Since
the device only exists in VMs, it's possible the bochs driver has never run on
Intel CPUs with WC memtype.

The one thing that confuses and concerns me is that this broke in the first place.
KVM has honored guest PAT on SVM since forever, which is why I/we had decent
confidence KVM could honor guest PAT on VMX without breaking anything.  SVM (NPT)
has an explicitlyed document special "WC+" memtype, where guest=WC && host=WB == WC+,
and WC+ accesses snoop caches on all CPUs.

But per Intel engineers, Intel CPUs with self-snoop are supposed to snoop caches
on all processors too.

I assume this same setup works fine on AMD/SVM?  If so, we probably need to do
more digging before fudging around this in the guest.
Yan Zhao Sept. 2, 2024, 1:44 a.m. UTC | #20
On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
> Gerd Hoffmann <kraxel@redhat.com> writes:
> 
> >> Necroposting!
> >> 
> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> modern guests', it was basically a configuration error :-). E.g:
> >
> > qemu stdvga (the default display device) is affected too.
> >
> 
> So far, I was only able to verify that the issue has nothing to do with
> OVMF and multi-vcpu, it reproduces very well with
> 
> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> Comparing traces of working and broken cases, I couldn't find anything
> suspicious but I may had missed something of course. For now, it seems
> like a userspace misbehavior resulting in a segfault.
Could you please share steps launch the broken guest desktop?
(better also with guest kernel version, name of desktop processes,
 name of X server)

Currently, I couldn't reproduce the error with "-device bochs-display" or
"-device VGA" locally on a "Coffee Lake-S" test machine. 

Qemu cmd as below:
qemu-system-x86_64 -m 4096 -smp 1 -M q35 -name guest-01
-hda ubuntu22-1.qcow2 -bios /usr/bin/bios.bin -enable-kvm -k en-us
-serial stdio -device bochs-display -machine kernel_irqchip=on
-cpu host -usb -usbdevice tablet

The guest can see a VGA device
    00:02.0 Display controller: Device 1234:1111 (rev 02)
with driver
    # readlink /sys/bus/pci/devices/0000\:00\:02.0/driver
    ../../../bus/pci/drivers/bochs-drm

I have tried hardcoding several fields as below:

(1)  hardcoded the fb_map to wc in the guest driver

--- a/drivers/gpu/drm/tiny/bochs.c
+++ b/drivers/gpu/drm/tiny/bochs.c
@@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
        if (pci_request_region(pdev, 0, "bochs-drm") != 0)
                DRM_WARN("Cannot request framebuffer, boot fb still active?\n");

-       bochs->fb_map = ioremap(addr, size);
+       bochs->fb_map = ioremap_wc(addr, size);
+       printk("bochs wc fb_map=%lx, addr=%lx, size=%lx\n", (unsigned long)bochs->fb_map, (unsigned long)addr, (unsigned long)size);
        if (bochs->fb_map == NULL) {
                DRM_ERROR("Cannot map framebuffer\n");
                return -ENOMEM;

With dmesg as below:

[    7.565840] ioremap wc phys_addr fd000000 size 1000000 to wc
[    7.565856] bochs wc fb_map=ffffc90004000000, addr=fd000000, size=1000000
[    7.565859] [drm] Found bochs VGA, ID 0xb0c5.
[    7.565861] [drm] Framebuffer size 16384 kB @ 0xfd000000, mmio @ 0xfebd9000.
[    7.591995] [drm] Found EDID data blob.
[    7.603956] [drm] Initialized bochs-drm 1.0.0 20130925 for 0000:00:02.0 on minor 1
[    7.614263] bochs-drm 0000:00:02.0: [drm] fb1: bochs-drmdrmfb frame buffer device

(2) hardcoded the memory type to WC in KVM intel driver.
+       if (gfn >= 0xfd000 && gfn < 0xfe000)
+               return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;


(3) hardcoded mmap flags to WC for some bo objects for Xorg.
Gerd Hoffmann Sept. 2, 2024, 8:23 a.m. UTC | #21
> > > Yes? :-) As Gerd described, video memory is "mapped into userspace so
> > > the wayland / X11 display server can software-render into the buffer"
> > > and it seems that wayland gets something unexpected in this memory and
> > > crashes. 
> > 
> > Also, I don't know if it helps or not, but out of two hunks in
> > 377b2f359d1f, it is the vmx_get_mt_mask() one which brings the
> > issue. I.e. the following is enough to fix things:
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index f18c2d8c7476..733a0c45d1a6 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7659,13 +7659,11 @@ u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  
> >         /*
> >          * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> > -        * device attached and the CPU doesn't support self-snoop.  Letting the
> > -        * guest control memory types on Intel CPUs without self-snoop may
> > -        * result in unexpected behavior, and so KVM's (historical) ABI is to
> > -        * trust the guest to behave only as a last resort.
> > +        * device attached.  Letting the guest control memory types on Intel
> > +        * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> > +        * the guest to behave only as a last resort.
> >          */
> > -       if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > -           !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +       if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >  
> >         return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> 
> Hmm, that suggests the guest kernel maps the buffer as WC.  And looking at the
> bochs driver, IIUC, the kernel mappings via ioremap() are UC-, not WC.  So it
> could be that userspace doesn't play nice with WC, but could it also be that the
> QEMU backend doesn't play nice with WC (on Intel)?
> 
> Given that this is a purely synthetic device, is there any reason to use UC or WC?

Well, sharing code with other (real hardware) drivers is pretty much the
only reason.  DRM has a set of helper functions to manage vram in pci
memory bars (see drm_gem_vram_helper.c, drm_gem_ttm_helper.c).

> I.e. can the bochs driver configure its VRAM buffers to be WB?  It doesn't look
> super easy (the DRM/TTM code has so. many. layers), but it appears doable.  Since
> the device only exists in VMs, it's possible the bochs driver has never run on
> Intel CPUs with WC memtype.

Thomas Zimmermann <tzimmermann@suse.de> (Cc'ed) has a drm patch series
in flight which switches the bochs driver to a shadow buffer model, i.e.
all the buffers visible to fbcon and userspace live in main memory.
Display updates are handled via in-kernel memcpy from shadow to vram.
The pci memory bar becomes an bochs driver implementation detail not
visible outside the driver.  This should give the bochs driver the
freedom to map vram with whatever attributes work best with kvm, without
needing drm changes outside the driver.

Of course all this does not help much with current distro kernels broken
by this patch ...

take care,
  Gerd
Vitaly Kuznetsov Sept. 2, 2024, 9:49 a.m. UTC | #22
Yan Zhao <yan.y.zhao@intel.com> writes:

> On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
>> Gerd Hoffmann <kraxel@redhat.com> writes:
>> 
>> >> Necroposting!
>> >> 
>> >> Turns out that this change broke "bochs-display" driver in QEMU even
>> >> when the guest is modern (don't ask me 'who the hell uses bochs for
>> >> modern guests', it was basically a configuration error :-). E.g:
>> >
>> > qemu stdvga (the default display device) is affected too.
>> >
>> 
>> So far, I was only able to verify that the issue has nothing to do with
>> OVMF and multi-vcpu, it reproduces very well with
>> 
>> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
>> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
>> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
>> -vnc :0 -device VGA -monitor stdio --no-reboot
>> 
>> Comparing traces of working and broken cases, I couldn't find anything
>> suspicious but I may had missed something of course. For now, it seems
>> like a userspace misbehavior resulting in a segfault.
> Could you please share steps launch the broken guest desktop?
> (better also with guest kernel version, name of desktop processes,
>  name of X server)

I think the easiest would be to download the latest Centos Stream 10
iso, e.g:

https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso
(the link is probably not eternal but should work for a couple weeks,
check https://composes.stream.centos.org/stream-10/development/ it it
doesn't work anymore).

Then, just run it:
$ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot

and connect to VNC console. To speed things up, pick 'Install Centos
Stream 10' in the boot menu to avoid ISO integrity check.

With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
commit included, you will see the following on the VNC console:
installer tries starting Wayland, crashes and drops back into text
console. If you revert the commit and start over, Wayland will normally
start and you will see the installer.

If the installer environment is inconvenient for debugging, then you can
install in text mode (or with the commit reverted :-) and then the same
problem will be observed when gdm starts.

FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
Silver 4410Y".
Yan Zhao Sept. 3, 2024, 12:25 a.m. UTC | #23
On Mon, Sep 02, 2024 at 11:49:43AM +0200, Vitaly Kuznetsov wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Fri, Aug 30, 2024 at 03:47:11PM +0200, Vitaly Kuznetsov wrote:
> >> Gerd Hoffmann <kraxel@redhat.com> writes:
> >> 
> >> >> Necroposting!
> >> >> 
> >> >> Turns out that this change broke "bochs-display" driver in QEMU even
> >> >> when the guest is modern (don't ask me 'who the hell uses bochs for
> >> >> modern guests', it was basically a configuration error :-). E.g:
> >> >
> >> > qemu stdvga (the default display device) is affected too.
> >> >
> >> 
> >> So far, I was only able to verify that the issue has nothing to do with
> >> OVMF and multi-vcpu, it reproduces very well with
> >> 
> >> $ qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s
> >> -cpu host -smp 1 -m 16384 -drive file=/var/lib/libvirt/images/c10s-bios.qcow2,if=none,id=drive-ide0-0-0
> >> -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
> >> -vnc :0 -device VGA -monitor stdio --no-reboot
> >> 
> >> Comparing traces of working and broken cases, I couldn't find anything
> >> suspicious but I may had missed something of course. For now, it seems
> >> like a userspace misbehavior resulting in a segfault.
> > Could you please share steps launch the broken guest desktop?
> > (better also with guest kernel version, name of desktop processes,
> >  name of X server)
> 
> I think the easiest would be to download the latest Centos Stream 10
> iso, e.g:
> 
> https://composes.stream.centos.org/stream-10/development/CentOS-Stream-10-20240902.d.0/compose/BaseOS/x86_64/iso/CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso
> (the link is probably not eternal but should work for a couple weeks,
> check https://composes.stream.centos.org/stream-10/development/ it it
> doesn't work anymore).
> 
> Then, just run it:
> $ /usr/libexec/qemu-kvm -machine q35,accel=kvm,kernel-irqchip=split -name guest=c10s -cpu host -smp 1 -m 16384 -cdrom CentOS-Stream-10-20240902.d.0-x86_64-dvd1.iso -vnc :0 -device VGA -monitor stdio --no-reboot
> 
> and connect to VNC console. To speed things up, pick 'Install Centos
> Stream 10' in the boot menu to avoid ISO integrity check.
> 
> With "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"
> commit included, you will see the following on the VNC console:
> installer tries starting Wayland, crashes and drops back into text
> console. If you revert the commit and start over, Wayland will normally
> start and you will see the installer.
Hmm, looks this issue can't be reproduced on physical machine "Coffee Lake-S".
The installer can show up to ask for language selection.

But it can be reproduced on the machine "Sapphire Rapids XCC".

> If the installer environment is inconvenient for debugging, then you can
> install in text mode (or with the commit reverted :-) and then the same
> problem will be observed when gdm starts.
>
Same to the gdm.

> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> Silver 4410Y".
Will have a debug to check what's going wrong. Thanks!
Sean Christopherson Sept. 3, 2024, 3:30 p.m. UTC | #24
On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> Silver 4410Y".

Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
on another hardware issue?
Vitaly Kuznetsov Sept. 3, 2024, 4:20 p.m. UTC | #25
Sean Christopherson <seanjc@google.com> writes:

> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>> Silver 4410Y".
>
> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> on another hardware issue?

Very possible, as according to Yan Zhao this doesn't reproduce on at
least "Coffee Lake-S". Let me try to grab some random hardware around
and I'll be back with my observations.
Yan Zhao Sept. 4, 2024, 2:28 a.m. UTC | #26
On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
> > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> >> Silver 4410Y".
> >
> > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > on another hardware issue?
> 
> Very possible, as according to Yan Zhao this doesn't reproduce on at
> least "Coffee Lake-S". Let me try to grab some random hardware around
> and I'll be back with my observations.

Update some new findings from my side:

BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
from 0xfd000000 to 0xfe000000.

On "Sapphire Rapids XCC":

1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
   correctly. 
   i.e.
   if (gfn >= 0xfd000 && gfn < 0xfe000) {
   	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
   }
   return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;

2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
   restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
   correctly in this case).

3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
   this fb_map range as WC, with
   iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));

   However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
   this fb_map has been reserved as uc- by ioremap().
   Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.

   So, with KVM setting WB (no IPAT) to this fb_map range, the effective
   memory type is UC- and installer/gdm restarts endlessly.

4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
   to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
   (didn't verify the installer's case as I can't update the driver in that case).

   The reason is that the ioremap_wc() called during starting GDM will no longer
   meet conflict and can map guest PAT as WC.


WIP to find out why effective UC in fb_map range will make gdm to restart
endlessly.
Vitaly Kuznetsov Sept. 4, 2024, 11:47 a.m. UTC | #27
Vitaly Kuznetsov <vkuznets@redhat.com> writes:

> Sean Christopherson <seanjc@google.com> writes:
>
>> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>>> Silver 4410Y".
>>
>> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
>> on another hardware issue?
>
> Very possible, as according to Yan Zhao this doesn't reproduce on at
> least "Coffee Lake-S". Let me try to grab some random hardware around
> and I'll be back with my observations.

We have some interesting results :-)

In addition to Sapphire Rapids, the issue also reproduces on a Cascade
lake CPU (Intel(R) Xeon(R) Silver 4214 CPU) but does NOT reproduce on
Skylake (Intel(R) Xeon(R) Gold 5118 CPU). I don't have a lot of desktop
CPUs around, so can't say much.

AMD also doesn't seem to be affected, at leats AMD EPYC 7413 works fine.
Yan Zhao Sept. 4, 2024, 12:17 p.m. UTC | #28
On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > Sean Christopherson <seanjc@google.com> writes:
> > 
> > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > >> Silver 4410Y".
> > >
> > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > on another hardware issue?
> > 
> > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > least "Coffee Lake-S". Let me try to grab some random hardware around
> > and I'll be back with my observations.
> 
> Update some new findings from my side:
> 
> BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> from 0xfd000000 to 0xfe000000.
> 
> On "Sapphire Rapids XCC":
> 
> 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
>    correctly. 
>    i.e.
>    if (gfn >= 0xfd000 && gfn < 0xfe000) {
>    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>    }
>    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> 
> 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
>    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
>    correctly in this case).
> 
> 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
>    this fb_map range as WC, with
>    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> 
>    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
>    this fb_map has been reserved as uc- by ioremap().
>    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> 
>    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
>    memory type is UC- and installer/gdm restarts endlessly.
> 
> 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
>    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
>    (didn't verify the installer's case as I can't update the driver in that case).
> 
>    The reason is that the ioremap_wc() called during starting GDM will no longer
>    meet conflict and can map guest PAT as WC.
> 
> 
> WIP to find out why effective UC in fb_map range will make gdm to restart
> endlessly.
Not sure whether it's simply because UC is too slow.

T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times

              | Sapphire Rapids XCC  | Coffee Lake-S
--------------|----------------------|-----------------
KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
--------------|----------------------|-----------------
KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
--------------|----------------------|-----------------
KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
------------------------------------------------------
Sean Christopherson Sept. 5, 2024, 12:41 a.m. UTC | #29
On Wed, Sep 04, 2024, Yan Zhao wrote:
> On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > Sean Christopherson <seanjc@google.com> writes:
> > > 
> > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > >> Silver 4410Y".
> > > >
> > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > on another hardware issue?
> > > 
> > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > and I'll be back with my observations.
> > 
> > Update some new findings from my side:
> > 
> > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > from 0xfd000000 to 0xfe000000.
> > 
> > On "Sapphire Rapids XCC":
> > 
> > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> >    correctly. 
> >    i.e.
> >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >    }
> >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > 
> > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> >    correctly in this case).
> > 
> > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> >    this fb_map range as WC, with
> >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > 
> >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> >    this fb_map has been reserved as uc- by ioremap().
> >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > 
> >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> >    memory type is UC- and installer/gdm restarts endlessly.
> > 
> > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> >    (didn't verify the installer's case as I can't update the driver in that case).
> > 
> >    The reason is that the ioremap_wc() called during starting GDM will no longer
> >    meet conflict and can map guest PAT as WC.

Huh.  The upside of this is that it sounds like there's nothing broken with WC
or self-snoop.

> > WIP to find out why effective UC in fb_map range will make gdm to restart
> > endlessly.
> Not sure whether it's simply because UC is too slow.
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times
> 
>               | Sapphire Rapids XCC  | Coffee Lake-S
> --------------|----------------------|-----------------
> KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s

Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
with UC specifically, or if it occurs with any accesses that immediately write
through to main memory.

> --------------|----------------------|-----------------
> KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> --------------|----------------------|-----------------
> KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> ------------------------------------------------------
Yan Zhao Sept. 5, 2024, 9:43 a.m. UTC | #30
On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> On Wed, Sep 04, 2024, Yan Zhao wrote:
> > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > Sean Christopherson <seanjc@google.com> writes:
> > > > 
> > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > >> Silver 4410Y".
> > > > >
> > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > on another hardware issue?
> > > > 
> > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > and I'll be back with my observations.
> > > 
> > > Update some new findings from my side:
> > > 
> > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > from 0xfd000000 to 0xfe000000.
> > > 
> > > On "Sapphire Rapids XCC":
> > > 
> > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > >    correctly. 
> > >    i.e.
> > >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > >    }
> > >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > 
> > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > >    correctly in this case).
> > > 
> > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > >    this fb_map range as WC, with
> > >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > 
> > >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > >    this fb_map has been reserved as uc- by ioremap().
> > >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > 
> > >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > >    memory type is UC- and installer/gdm restarts endlessly.
> > > 
> > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > >    (didn't verify the installer's case as I can't update the driver in that case).
> > > 
> > >    The reason is that the ioremap_wc() called during starting GDM will no longer
> > >    meet conflict and can map guest PAT as WC.
> 
> Huh.  The upside of this is that it sounds like there's nothing broken with WC
> or self-snoop.
Considering a different perspective, the fb_map range is used as frame buffer
(vram), with the guest writing to this range and the host reading from it.
If the issue were related to self-snooping, we would expect the VNC window to
display distorted data. However, the observed behavior is that the GDM window
shows up correctly for a sec and restarts over and over.

So, do you think we can simply fix this issue by calling ioremap_wc() for the
frame buffer/vram range in bochs driver, as is commonly done in other gpu
drivers?

--- a/drivers/gpu/drm/tiny/bochs.c
+++ b/drivers/gpu/drm/tiny/bochs.c
@@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
        if (pci_request_region(pdev, 0, "bochs-drm") != 0)
                DRM_WARN("Cannot request framebuffer, boot fb still active?\n");

-       bochs->fb_map = ioremap(addr, size);
+       bochs->fb_map = ioremap_wc(addr, size);
        if (bochs->fb_map == NULL) {
                DRM_ERROR("Cannot map framebuffer\n");
                return -ENOMEM;


> 
> > > WIP to find out why effective UC in fb_map range will make gdm to restart
> > > endlessly.
> > Not sure whether it's simply because UC is too slow.
> > 
> > T=Test execution time of a selftest in which guest writes to a GPA for
> >   0x1000000UL times
> > 
> >               | Sapphire Rapids XCC  | Coffee Lake-S
> > --------------|----------------------|-----------------
> > KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
> 
> Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
> with UC specifically, or if it occurs with any accesses that immediately write
> through to main memory.
> 
> > --------------|----------------------|-----------------
> > KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> > --------------|----------------------|-----------------
> > KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> > ------------------------------------------------------

I re-run all the tests and collected an averaged data (10 times each) as
below (previous data was just a single-run score):


T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times with WRITE_ONCE

KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
-------------|---------------------|----------------
 WB+IPAT     |     T=0.1511s       |    T=0.1661s
-------------|---------------------|----------------
 WC+IPAT     |     T=0.1411s       |    T=0.1656s
-------------|---------------------|----------------
 WT+IPAT     |     T=3.7527s       |    T=0.6156s
-------------|---------------------|----------------
 WP+IPAT     |     T=4.4663s       |    T=0.6203s
-------------|---------------------|----------------
 UC+IPAT     |     T=3.4632s       |    T=0.5868s


T=Test execution time of a selftest in which guest writes to a GPA for
  0x1000000UL times with movdir64b.

(Coffee Lake-S has no feature movdir64).

KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
-------------|---------------------|----------------
 WB+IPAT     |     T=2.6142s       |       /     
-------------|---------------------|----------------
 WC+IPAT     |     T=2.8919s       |       /     
-------------|---------------------|----------------
 WT+IPAT     |     T=3.0966s       |       /      
-------------|---------------------|----------------
 WP+IPAT     |     T=2.4933s       |       /     
-------------|---------------------|----------------
 UC+IPAT     |     T=3.4606s       |       /
Yan Zhao Sept. 9, 2024, 5:30 a.m. UTC | #31
On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
> On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> > On Wed, Sep 04, 2024, Yan Zhao wrote:
> > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > > Sean Christopherson <seanjc@google.com> writes:
> > > > > 
> > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > > >> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > > >> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > > >> Silver 4410Y".
> > > > > >
> > > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > > on another hardware issue?
> > > > > 
> > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > > and I'll be back with my observations.
> > > > 
> > > > Update some new findings from my side:
> > > > 
> > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > > from 0xfd000000 to 0xfe000000.
> > > > 
> > > > On "Sapphire Rapids XCC":
> > > > 
> > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > > >    correctly. 
> > > >    i.e.
> > > >    if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > > >    	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > > >    }
> > > >    return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > > 
> > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > > >    restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > > >    correctly in this case).
> > > > 
> > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > > >    this fb_map range as WC, with
> > > >    iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > > 
> > > >    However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > > >    this fb_map has been reserved as uc- by ioremap().
> > > >    Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > > 
> > > >    So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > > >    memory type is UC- and installer/gdm restarts endlessly.
> > > > 
> > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > > >    to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > > >    (didn't verify the installer's case as I can't update the driver in that case).
> > > > 
> > > >    The reason is that the ioremap_wc() called during starting GDM will no longer
> > > >    meet conflict and can map guest PAT as WC.
> > 
> > Huh.  The upside of this is that it sounds like there's nothing broken with WC
> > or self-snoop.
> Considering a different perspective, the fb_map range is used as frame buffer
> (vram), with the guest writing to this range and the host reading from it.
> If the issue were related to self-snooping, we would expect the VNC window to
> display distorted data. However, the observed behavior is that the GDM window
> shows up correctly for a sec and restarts over and over.
> 
> So, do you think we can simply fix this issue by calling ioremap_wc() for the
> frame buffer/vram range in bochs driver, as is commonly done in other gpu
> drivers?
> 
> --- a/drivers/gpu/drm/tiny/bochs.c
> +++ b/drivers/gpu/drm/tiny/bochs.c
> @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
>         if (pci_request_region(pdev, 0, "bochs-drm") != 0)
>                 DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
> 
> -       bochs->fb_map = ioremap(addr, size);
> +       bochs->fb_map = ioremap_wc(addr, size);
>         if (bochs->fb_map == NULL) {
>                 DRM_ERROR("Cannot map framebuffer\n");
>                 return -ENOMEM;
> 
> 
> > 
> > > > WIP to find out why effective UC in fb_map range will make gdm to restart
> > > > endlessly.
> > > Not sure whether it's simply because UC is too slow.
> > > 
> > > T=Test execution time of a selftest in which guest writes to a GPA for
> > >   0x1000000UL times
> > > 
> > >               | Sapphire Rapids XCC  | Coffee Lake-S
> > > --------------|----------------------|-----------------
> > > KVM UC+IPAT   |    T=0m4.530s        |  T=0m0.622s
> > 
> > Woah.  Have you tried testing MOVDIR64 and/or WT?  E.g. to see if the problem is
> > with UC specifically, or if it occurs with any accesses that immediately write
> > through to main memory.
> > 
> > > --------------|----------------------|-----------------
> > > KVM WC+IPAT   |    T=0m0.149s        |  T=0m0.176s
> > > --------------|----------------------|-----------------
> > > KVM WB+IPAT   |    T=0m0.148s        |  T=0m0.148s
> > > ------------------------------------------------------
> 
> I re-run all the tests and collected an averaged data (10 times each) as
> below (previous data was just a single-run score):
> 
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times with WRITE_ONCE
> 
> KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
> -------------|---------------------|----------------
>  WB+IPAT     |     T=0.1511s       |    T=0.1661s
> -------------|---------------------|----------------
>  WC+IPAT     |     T=0.1411s       |    T=0.1656s
> -------------|---------------------|----------------
>  WT+IPAT     |     T=3.7527s       |    T=0.6156s
> -------------|---------------------|----------------
>  WP+IPAT     |     T=4.4663s       |    T=0.6203s
> -------------|---------------------|----------------
>  UC+IPAT     |     T=3.4632s       |    T=0.5868s
> 
> 
> T=Test execution time of a selftest in which guest writes to a GPA for
>   0x1000000UL times with movdir64b.
> 
> (Coffee Lake-S has no feature movdir64).
> 
> KVM memtype  | Sapphire Rapids XCC | Coffee Lake-S
> -------------|---------------------|----------------
>  WB+IPAT     |     T=2.6142s       |       /     
> -------------|---------------------|----------------
>  WC+IPAT     |     T=2.8919s       |       /     
> -------------|---------------------|----------------
>  WT+IPAT     |     T=3.0966s       |       /      
> -------------|---------------------|----------------
>  WP+IPAT     |     T=2.4933s       |       /     
> -------------|---------------------|----------------
>  UC+IPAT     |     T=3.4606s       |       /     
>
Up to now, I think I have root caused this issue.

Status before this update:
In either ubuntu or centos, on "Sapphire Rapids XCC"
- gdm fails to launch gnome-shell when wayland is enabled, when
  effective memory type is UC/UC-.
- gdm is able launch gnome-shell correctly when wayland is enabled, when
  effective memory type is WB or WC.
- gdm is able launch gnome-shell correctly when wayland is not enabled, with
  any effective memory type.


Update:
1. I tried KVM memtype = WT + IPAT for this framebuffer range,
   gdm fails to launch gnome-shell when wayland is enabled.

   Since the only difference between WT and WB is that write in WT is slow,
   the failure should not be self-snoop issue.


2. The current bochs driver calls ioremap() to map framebuffer range.
   On x86 architectures, ioremap() maps VA with PAT=UC- and invokes
   memtype_reserve() to reserve the memory type as UC- for the physical range.
   This reservation can cause subsequent calls to ioremap_wc() to fail to map
   the VA with PAT=WC to the same framebuffer range in
   ttm_kmap_iter_linear_io_init().
   Consequently, the operation drm_gem_vram_bo_driver_move() become
   significantly slow on platforms where UC memory access is slow.

   When host KVM honors guest PAT memory types, the effective memory type        
   for this framebuffer range is                                                    
   - WC when ioremap_wc() is used in driver probing phase                           
   - UC- when ioremap() is used.

   I measured the data below for drm_gem_vram_bo_driver_move() which
   does memset to this framebuffer range with size 0x3e8000.

     ---------------------------------------------------------------
    |                               |      in bochs_hw_init()       |
    |                               |    ioremap()   | ioremap_wc() |
    |-------------------------------|----------------|--------------|
    |     cycles of                 |    2227.4M     |   17.8M      |
    | drm_gem_vram_bo_driver_move() |                |              |
    |-------------------------------|----------------|--------------|
    |     time of                   |    1.24s       |   0.01s      |
    | drm_gem_vram_bo_driver_move() |                |              |
     ---------------------------------------------------------------

    drm_gem_vram_bo_driver_move
       ttm_bo_move_memcpy()
           ttm_kmap_iter_linear_io_init()
	       iosys_map_set_vaddr_iomem(&iter_io->dmap,
	                                 ioremap_wc(mem->bus.offset,mem->size));
	   ttm_move_memcpy
	       memset_io or
	       drm_memcpy_from_wc


   If I comment out the memset_io() and drm_memcpy_from_wc() in
   ttm_move_memcpy(), drm_gem_vram_bo_driver_move() can be very fast and gdm is
   able to launch gnome-shell and login successfully, though sometime the
   screen is a little blurred. 

3. I sent a fix at [1] to let guest bochs driver map the framebuffer
   with PAT=WC for kernel access.
   [1] https://lore.kernel.org/all/20240909051529.26776-1-yan.y.zhao@intel.com/
Paolo Bonzini Sept. 9, 2024, 1:24 p.m. UTC | #32
On 9/9/24 07:30, Yan Zhao wrote:
> On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
>> On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
>>> On Wed, Sep 04, 2024, Yan Zhao wrote:
>>>> On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
>>>>> On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
>>>>>> Sean Christopherson <seanjc@google.com> writes:
>>>>>>
>>>>>>> On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
>>>>>>>> FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
>>>>>>>> but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
>>>>>>>> Silver 4410Y".
>>>>>>>
>>>>>>> Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
>>>>>>> on another hardware issue?
>>>>>>
>>>>>> Very possible, as according to Yan Zhao this doesn't reproduce on at
>>>>>> least "Coffee Lake-S". Let me try to grab some random hardware around
>>>>>> and I'll be back with my observations.
>>>>>
>>>>> Update some new findings from my side:
>>>>>
>>>>> BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
>>>>> from 0xfd000000 to 0xfe000000.
>>>>>
>>>>> On "Sapphire Rapids XCC":
>>>>>
>>>>> 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
>>>>>     correctly.
>>>>>     i.e.
>>>>>     if (gfn >= 0xfd000 && gfn < 0xfe000) {
>>>>>     	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>>>>>     }
>>>>>     return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
>>>>>
>>>>> 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
>>>>>     restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
>>>>>     correctly in this case).
>>>>>
>>>>> 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
>>>>>     this fb_map range as WC, with
>>>>>     iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
>>>>>
>>>>>     However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
>>>>>     this fb_map has been reserved as uc- by ioremap().
>>>>>     Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
>>>>>
>>>>>     So, with KVM setting WB (no IPAT) to this fb_map range, the effective
>>>>>     memory type is UC- and installer/gdm restarts endlessly.
>>>>>
>>>>> 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
>>>>>     to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
>>>>>     (didn't verify the installer's case as I can't update the driver in that case).
>>>>>
>>>>>     The reason is that the ioremap_wc() called during starting GDM will no longer
>>>>>     meet conflict and can map guest PAT as WC.
>>>
>>> Huh.  The upside of this is that it sounds like there's nothing broken with WC
>>> or self-snoop.
>> Considering a different perspective, the fb_map range is used as frame buffer
>> (vram), with the guest writing to this range and the host reading from it.
>> If the issue were related to self-snooping, we would expect the VNC window to
>> display distorted data. However, the observed behavior is that the GDM window
>> shows up correctly for a sec and restarts over and over.
>>
>> So, do you think we can simply fix this issue by calling ioremap_wc() for the
>> frame buffer/vram range in bochs driver, as is commonly done in other gpu
>> drivers?
>>
>> --- a/drivers/gpu/drm/tiny/bochs.c
>> +++ b/drivers/gpu/drm/tiny/bochs.c
>> @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
>>          if (pci_request_region(pdev, 0, "bochs-drm") != 0)
>>                  DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
>>
>> -       bochs->fb_map = ioremap(addr, size);
>> +       bochs->fb_map = ioremap_wc(addr, size);
>>          if (bochs->fb_map == NULL) {
>>                  DRM_ERROR("Cannot map framebuffer\n");
>>                  return -ENOMEM;

While this is a fix for future kernels, it doesn't change the result for 
VMs already in existence.

I don't think there's an alternative to putting this behind a quirk.

Paolo
Sean Christopherson Sept. 9, 2024, 4:04 p.m. UTC | #33
On Mon, Sep 09, 2024, Paolo Bonzini wrote:
> On 9/9/24 07:30, Yan Zhao wrote:
> > On Thu, Sep 05, 2024 at 05:43:17PM +0800, Yan Zhao wrote:
> > > On Wed, Sep 04, 2024 at 05:41:06PM -0700, Sean Christopherson wrote:
> > > > On Wed, Sep 04, 2024, Yan Zhao wrote:
> > > > > On Wed, Sep 04, 2024 at 10:28:02AM +0800, Yan Zhao wrote:
> > > > > > On Tue, Sep 03, 2024 at 06:20:27PM +0200, Vitaly Kuznetsov wrote:
> > > > > > > Sean Christopherson <seanjc@google.com> writes:
> > > > > > > 
> > > > > > > > On Mon, Sep 02, 2024, Vitaly Kuznetsov wrote:
> > > > > > > > > FWIW, I use QEMU-9.0 from the same C10S (qemu-kvm-9.0.0-7.el10.x86_64)
> > > > > > > > > but I don't think it matters in this case. My CPU is "Intel(R) Xeon(R)
> > > > > > > > > Silver 4410Y".
> > > > > > > > 
> > > > > > > > Has this been reproduced on any other hardware besides SPR?  I.e. did we stumble
> > > > > > > > on another hardware issue?
> > > > > > > 
> > > > > > > Very possible, as according to Yan Zhao this doesn't reproduce on at
> > > > > > > least "Coffee Lake-S". Let me try to grab some random hardware around
> > > > > > > and I'll be back with my observations.
> > > > > > 
> > > > > > Update some new findings from my side:
> > > > > > 
> > > > > > BAR 0 of bochs VGA (fb_map) is used for frame buffer, covering phys range
> > > > > > from 0xfd000000 to 0xfe000000.
> > > > > > 
> > > > > > On "Sapphire Rapids XCC":
> > > > > > 
> > > > > > 1. If KVM forces this fb_map range to be WC+IPAT, installer/gdm can launch
> > > > > >     correctly.
> > > > > >     i.e.
> > > > > >     if (gfn >= 0xfd000 && gfn < 0xfe000) {
> > > > > >     	return (MTRR_TYPE_WRCOMB << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> > > > > >     }
> > > > > >     return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > > > > > 
> > > > > > 2. If KVM forces this fb_map range to be UC+IPAT, installer failes to show / gdm
> > > > > >     restarts endlessly. (though on Coffee Lake-S, installer/gdm can launch
> > > > > >     correctly in this case).
> > > > > > 
> > > > > > 3. On starting GDM, ttm_kmap_iter_linear_io_init() in guest is called to set
> > > > > >     this fb_map range as WC, with
> > > > > >     iosys_map_set_vaddr_iomem(&iter_io->dmap, ioremap_wc(mem->bus.offset, mem->size));
> > > > > > 
> > > > > >     However, during bochs_pci_probe()-->bochs_load()-->bochs_hw_init(), pfns for
> > > > > >     this fb_map has been reserved as uc- by ioremap().
> > > > > >     Then, the ioremap_wc() during starting GDM will only map guest PAT with UC-.
> > > > > > 
> > > > > >     So, with KVM setting WB (no IPAT) to this fb_map range, the effective
> > > > > >     memory type is UC- and installer/gdm restarts endlessly.
> > > > > > 
> > > > > > 4. If KVM sets WB (no IPAT) to this fb_map range, and changes guest bochs driver
> > > > > >     to call ioremap_wc() instead in bochs_hw_init(), gdm can launch correctly.
> > > > > >     (didn't verify the installer's case as I can't update the driver in that case).
> > > > > > 
> > > > > >     The reason is that the ioremap_wc() called during starting GDM will no longer
> > > > > >     meet conflict and can map guest PAT as WC.
> > > > 
> > > > Huh.  The upside of this is that it sounds like there's nothing broken with WC
> > > > or self-snoop.
> > > Considering a different perspective, the fb_map range is used as frame buffer
> > > (vram), with the guest writing to this range and the host reading from it.
> > > If the issue were related to self-snooping, we would expect the VNC window to
> > > display distorted data. However, the observed behavior is that the GDM window
> > > shows up correctly for a sec and restarts over and over.
> > > 
> > > So, do you think we can simply fix this issue by calling ioremap_wc() for the
> > > frame buffer/vram range in bochs driver, as is commonly done in other gpu
> > > drivers?
> > > 
> > > --- a/drivers/gpu/drm/tiny/bochs.c
> > > +++ b/drivers/gpu/drm/tiny/bochs.c
> > > @@ -261,7 +261,9 @@ static int bochs_hw_init(struct drm_device *dev)
> > >          if (pci_request_region(pdev, 0, "bochs-drm") != 0)
> > >                  DRM_WARN("Cannot request framebuffer, boot fb still active?\n");
> > > 
> > > -       bochs->fb_map = ioremap(addr, size);
> > > +       bochs->fb_map = ioremap_wc(addr, size);
> > >          if (bochs->fb_map == NULL) {
> > >                  DRM_ERROR("Cannot map framebuffer\n");
> > >                  return -ENOMEM;
> 
> While this is a fix for future kernels, it doesn't change the result for VMs
> already in existence.

I would prefer to bottom out on exactly whether or not the SPR/CLX behavior is
working as intended.  Maybe the ~8x slowdown is just a side effect of any Intel
multi-socket/node system, but I think we should get confirmation (inasmuch as
possible) that that is indeed the case.  E.g. if this is actually a bug in CLX+,
then the actions we need to take are different.

> I don't think there's an alternative to putting this behind a quirk.

This gets a bit weird, which is why I want to bottom out on whether or not CLX
and SPR are working as intended.  If non-coherent DMA is attached to the VM, then
even before this patch KVM would honor guest PAT.  I agree that we don't want to
break existing setups, but if CLX+SPR are working as intended, then this is
inarguably a bochs driver bug, and I would prefer to have the quirk explicitly
reference bochs-compatible devices, e.g. in the name and documentation, so that
userspace can disable the quirk by default and only leave it enabled if a bochs
device is being exposed to the guest.
Yan Zhao Sept. 10, 2024, 1:05 a.m. UTC | #34
On Mon, Sep 09, 2024 at 03:24:40PM +0200, Paolo Bonzini wrote:
> While this is a fix for future kernels, it doesn't change the result for VMs
> already in existence.
Though this is the truth, I have concerns that there may be other guest drivers
with improper PAT configurations that were previously masked by KVM's force-WB
setting. Now that we respect the guest's PAT settings, these misconfigurations
could lead to degraded performance, potentially perceived as errors, as was
observed in the previous VMX unit test and the current Bochs scenario.

> I don't think there's an alternative to putting this behind a quirk.
Thorsten Leemhuis Oct. 7, 2024, 1:28 p.m. UTC | #35
On 30.08.24 11:35, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> 
>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>> even in the presence of aliased memtypes, thus there is no need to trust
>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>
>> Honoring guest PAT is desirable for use cases where the guest has access
>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>> with buffers that are consumed directly by the physical GPU, i.e. which
>> can't be proxied by the host to ensure writes from the guest are performed
>> with the correct memory type for the GPU.
> 
> Necroposting!
> 
> Turns out that this change broke "bochs-display" driver in QEMU even
> when the guest is modern (don't ask me 'who the hell uses bochs for
> modern guests', it was basically a configuration error :-). E.g:
> [...]

This regression made it to the list of tracked regressions. It seems
this thread stalled a while ago. Was this ever fixed? Does not look like
it, but I might have missed something. Or is this a regression I should
just ignore for one reason or another?


Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.

#regzbot poke
Vitaly Kuznetsov Oct. 7, 2024, 1:38 p.m. UTC | #36
"Linux regression tracking (Thorsten Leemhuis)"
<regressions@leemhuis.info> writes:

> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>> 
>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>> even in the presence of aliased memtypes, thus there is no need to trust
>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>
>>> Honoring guest PAT is desirable for use cases where the guest has access
>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>> can't be proxied by the host to ensure writes from the guest are performed
>>> with the correct memory type for the GPU.
>> 
>> Necroposting!
>> 
>> Turns out that this change broke "bochs-display" driver in QEMU even
>> when the guest is modern (don't ask me 'who the hell uses bochs for
>> modern guests', it was basically a configuration error :-). E.g:
>> [...]
>
> This regression made it to the list of tracked regressions. It seems
> this thread stalled a while ago. Was this ever fixed? Does not look like
> it, but I might have missed something. Or is this a regression I should
> just ignore for one reason or another?
>

The regression was addressed in by reverting 377b2f359d1f in 6.11

commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
Author: Paolo Bonzini <pbonzini@redhat.com>
Date:   Sun Sep 15 02:49:33 2024 -0400

    Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

Also, there's a (pending) DRM patch fixing it from the guest's side:
https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e
Thorsten Leemhuis Oct. 7, 2024, 2:04 p.m. UTC | #37
On 07.10.24 15:38, Vitaly Kuznetsov wrote:
> "Linux regression tracking (Thorsten Leemhuis)"
> <regressions@leemhuis.info> writes:
> 
>> On 30.08.24 11:35, Vitaly Kuznetsov wrote:
>>> Sean Christopherson <seanjc@google.com> writes:
>>>
>>>> Unconditionally honor guest PAT on CPUs that support self-snoop, as
>>>> Intel has confirmed that CPUs that support self-snoop always snoop caches
>>>> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>>>> even in the presence of aliased memtypes, thus there is no need to trust
>>>> the guest behaves and only honor PAT as a last resort, as KVM does today.
>>>>
>>>> Honoring guest PAT is desirable for use cases where the guest has access
>>>> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>>>> (mediated, for all intents and purposes) GPU is exposed to the guest, along
>>>> with buffers that are consumed directly by the physical GPU, i.e. which
>>>> can't be proxied by the host to ensure writes from the guest are performed
>>>> with the correct memory type for the GPU.
>>>
>>> Necroposting!
>>>
>>> Turns out that this change broke "bochs-display" driver in QEMU even
>>> when the guest is modern (don't ask me 'who the hell uses bochs for
>>> modern guests', it was basically a configuration error :-). E.g:
>>> [...]
>>
>> This regression made it to the list of tracked regressions. It seems
>> this thread stalled a while ago. Was this ever fixed? Does not look like
>> it, but I might have missed something. Or is this a regression I should
>> just ignore for one reason or another?
>>
> 
> The regression was addressed in by reverting 377b2f359d1f in 6.11
> 
> commit 9d70f3fec14421e793ffbc0ec2f739b24e534900
> Author: Paolo Bonzini <pbonzini@redhat.com>
> Date:   Sun Sep 15 02:49:33 2024 -0400
> 
>     Revert "KVM: VMX: Always honor guest PAT on CPUs that support self-snoop"

Thx. Sorry, missed that, thx for pointing me towards it. I had looked
for things like that, but seems I messed up my lore query. Apologies for
the noise!

> Also, there's a (pending) DRM patch fixing it from the guest's side:
> https://gitlab.freedesktop.org/drm/misc/kernel/-/commit/9388ccf69925223223c87355a417ba39b13a5e8e

Great!

Ciao, Thorsten

P.S.:

#regzbot fix: 9d70f3fec14421e793ffbc0ec2f739b24e534900
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 403cd8f914cd..7fa514830628 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4622,14 +4622,16 @@  static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 bool kvm_mmu_may_ignore_guest_pat(void)
 {
 	/*
-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
+	 * not support self-snoop (or is affected by an erratum), and the VM
 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
 	 * honor the memtype from the guest's PAT so that guest accesses to
 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
+	 * bits in response to non-coherent device (un)registration.
 	 */
-	return shadow_memtype_mask;
+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
 }
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 17a8e4fdf9c4..5dc4c24ae203 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7605,11 +7605,13 @@  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	/*
 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
-	 * device attached.  Letting the guest control memory types on Intel
-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
-	 * the guest to behave only as a last resort.
+	 * device attached and the CPU doesn't support self-snoop.  Letting the
+	 * guest control memory types on Intel CPUs without self-snoop may
+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
+	 * trust the guest to behave only as a last resort.
 	 */
-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
 	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);