diff mbox series

[5/5] KVM: VMX: Always honor guest PAT on CPUs that support self-snoop

Message ID 20240309010929.1403984-6-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: VMX: Drop MTRR virtualization, honor guest PAT | expand

Commit Message

Sean Christopherson March 9, 2024, 1:09 a.m. UTC
Unconditionally honor guest PAT on CPUs that support self-snoop, as
Intel has confirmed that CPUs that support self-snoop always snoop caches
and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
even in the presence of aliased memtypes, thus there is no need to trust
the guest behaves and only honor PAT as a last resort, as KVM does today.

Honoring guest PAT is desirable for use cases where the guest has access
to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
(mediated, for all intents and purposes) GPU is exposed to the guest, along
with buffers that are consumed directly by the physical GPU, i.e. which
can't be proxied by the host to ensure writes from the guest are performed
with the correct memory type for the GPU.

Cc: Yiwei Zhang <zzyiwei@google.com>
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c |  8 +++++---
 arch/x86/kvm/vmx/vmx.c | 10 ++++++----
 2 files changed, 11 insertions(+), 7 deletions(-)

Comments

Yan Zhao March 11, 2024, 1:16 a.m. UTC | #1
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> Unconditionally honor guest PAT on CPUs that support self-snoop, as
> Intel has confirmed that CPUs that support self-snoop always snoop caches
> and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> even in the presence of aliased memtypes, thus there is no need to trust
> the guest behaves and only honor PAT as a last resort, as KVM does today.
> 
> Honoring guest PAT is desirable for use cases where the guest has access
> to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> (mediated, for all intents and purposes) GPU is exposed to the guest, along
> with buffers that are consumed directly by the physical GPU, i.e. which
> can't be proxied by the host to ensure writes from the guest are performed
> with the correct memory type for the GPU.
> 
> Cc: Yiwei Zhang <zzyiwei@google.com>
> Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
> Suggested-by: Kevin Tian <kevin.tian@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  8 +++++---
>  arch/x86/kvm/vmx/vmx.c | 10 ++++++----
>  2 files changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 403cd8f914cd..7fa514830628 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  bool kvm_mmu_may_ignore_guest_pat(void)
>  {
>  	/*
> -	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
> +	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
> +	 * not support self-snoop (or is affected by an erratum), and the VM
>  	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
>  	 * honor the memtype from the guest's PAT so that guest accesses to
>  	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
>  	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
> -	 * KVM _always_ ignores guest PAT (when EPT is enabled).
> +	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
> +	 * bits in response to non-coherent device (un)registration.
>  	 */
> -	return shadow_memtype_mask;
> +	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
>  }
>  
>  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 17a8e4fdf9c4..5dc4c24ae203 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	/*
>  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> -	 * device attached.  Letting the guest control memory types on Intel
> -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> -	 * the guest to behave only as a last resort.
> +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> +	 * guest control memory types on Intel CPUs without self-snoop may
> +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> +	 * trust the guest to behave only as a last resort.
>  	 */
> -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
about unsafe before honoring guest memory type.
Though it's a KVM's historical ABI, it's not safe in the security perspective
because page aliasing without proper cache flush handling on CPUs without
self-snoop may open a door for guest to read uninitialized host data.
e.g. when there's a noncoherent DMA device attached, and if there's a memory
region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
skipped section), then though the guest memory from this memory region is not
accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
Then if vCPUs use WC as guest type, it may bypass host's initialization data in
cache and read stale data in host, causing information leak.

My preference is still to force WB
(i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
!static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
Firstly, it's because there're few CPUs with features VMX without self-snoop;
Secondly, security takes priority over functionality :)

>  
>  	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);
> -- 
> 2.44.0.278.ge034bb2e1d-goog
>
Sean Christopherson March 12, 2024, 12:25 a.m. UTC | #2
On Mon, Mar 11, 2024, Yan Zhao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> >  
> >  	/*
> >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> > -	 * device attached.  Letting the guest control memory types on Intel
> > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> > -	 * the guest to behave only as a last resort.
> > +	 * device attached and the CPU doesn't support self-snoop.  Letting the
> > +	 * guest control memory types on Intel CPUs without self-snoop may
> > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > +	 * trust the guest to behave only as a last resort.
> >  	 */
> > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> about unsafe before honoring guest memory type.

I don't think it gains us enough to offset the potential pain such a message
would bring.  Assuming the warning isn't outright ignored, the most likely scenario
is that the warning will cause random end users to worry that the setup they've
been running for years is broken, when in reality it's probably just fine for their
use case.

I would be quite surprised if there are people running untrusted workloads on
10+ year old silicon *and* have passthrough devices and non-coherent IOMMUs/DMA.
And anyone exposing a device directly to an untrusted workload really should have
done their homework.

And it's not like we're going to change KVM's historical behavior at this point.

> Though it's a KVM's historical ABI, it's not safe in the security perspective
> because page aliasing without proper cache flush handling on CPUs without
> self-snoop may open a door for guest to read uninitialized host data.
> e.g. when there's a noncoherent DMA device attached, and if there's a memory
> region that is not pinned in vfio/iommufd side, (e.g. memory region in vfio's
> skipped section), then though the guest memory from this memory region is not
> accessible to noncoherent DMAs, vCPUs can still access this part of guest memory.
> Then if vCPUs use WC as guest type, it may bypass host's initialization data in
> cache and read stale data in host, causing information leak.
> 
> My preference is still to force WB
> (i.e. (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT) in case of
> !static_cpu_has(X86_FEATURE_SELFSNOOP) && kvm_arch_has_noncoherent_dma(vcpu->kvm).
> Firstly, it's because there're few CPUs with features VMX without self-snoop;

This is unfortunately not true.  I don't know the details, but apparently all
Intel CPUs before Ivy Bridge had a one or more related errata.

/*
 * Processors which have self-snooping capability can handle conflicting
 * memory type across CPUs by snooping its own cache. However, there exists
 * CPU models in which having conflicting memory types still leads to
 * unpredictable behavior, machine check errors, or hangs. Clear this
 * feature to prevent its use on machines with known erratas.
 */
static void check_memory_type_self_snoop_errata(struct cpuinfo_x86 *c)
{
	switch (c->x86_model) {
	case INTEL_FAM6_CORE_YONAH:
	case INTEL_FAM6_CORE2_MEROM:
	case INTEL_FAM6_CORE2_MEROM_L:
	case INTEL_FAM6_CORE2_PENRYN:
	case INTEL_FAM6_CORE2_DUNNINGTON:
	case INTEL_FAM6_NEHALEM:
	case INTEL_FAM6_NEHALEM_G:
	case INTEL_FAM6_NEHALEM_EP:
	case INTEL_FAM6_NEHALEM_EX:
	case INTEL_FAM6_WESTMERE:
	case INTEL_FAM6_WESTMERE_EP:
	case INTEL_FAM6_SANDYBRIDGE:
		setup_clear_cpu_cap(X86_FEATURE_SELFSNOOP);
	}
}

> Secondly, security takes priority over functionality :)

Yeah, but not breaking userspace for setups that have existed for 10+ years takes
priority over all of that :-)
Tian, Kevin March 12, 2024, 7:30 a.m. UTC | #3
> From: Sean Christopherson <seanjc@google.com>
> Sent: Tuesday, March 12, 2024 8:26 AM
> 
> On Mon, Mar 11, 2024, Yan Zhao wrote:
> > On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 17a8e4fdf9c4..5dc4c24ae203 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu
> *vcpu, gfn_t gfn, bool is_mmio)
> > >
> > >  	/*
> > >  	 * Force WB and ignore guest PAT if the VM does NOT have a non-
> coherent
> > > -	 * device attached.  Letting the guest control memory types on Intel
> > > -	 * CPUs may result in unexpected behavior, and so KVM's ABI is to
> trust
> > > -	 * the guest to behave only as a last resort.
> > > +	 * device attached and the CPU doesn't support self-snoop.  Letting
> the
> > > +	 * guest control memory types on Intel CPUs without self-snoop may
> > > +	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> > > +	 * trust the guest to behave only as a last resort.
> > >  	 */
> > > -	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > > +	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > +	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > >  		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT)
> | VMX_EPT_IPAT_BIT;
> >
> > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should
> warn
> > about unsafe before honoring guest memory type.
> 
> I don't think it gains us enough to offset the potential pain such a message
> would bring.  Assuming the warning isn't outright ignored, the most likely
> scenario
> is that the warning will cause random end users to worry that the setup
> they've
> been running for years is broken, when in reality it's probably just fine for
> their
> use case.

Isn't the 'worry' necessary to allow end users evaluate whether "it's
probably just fine for their use case"?

I saw the old comment already mentioned that doing so may lead to
unexpected behaviors. But I'm not sure whether such code-level
caveat has been visible enough to end users.

> 
> I would be quite surprised if there are people running untrusted workloads
> on
> 10+ year old silicon *and* have passthrough devices and non-coherent
> IOMMUs/DMA.

this is probably true.

> And anyone exposing a device directly to an untrusted workload really
> should have
> done their homework.

or they run trusted workloads which might be tampered by virus to
exceed the scope of their homework. 
Sean Christopherson March 12, 2024, 4:07 p.m. UTC | #4
On Tue, Mar 12, 2024, Kevin Tian wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > Sent: Tuesday, March 12, 2024 8:26 AM
> > 
> > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > about unsafe before honoring guest memory type.
> > 
> > I don't think it gains us enough to offset the potential pain such a
> > message would bring.  Assuming the warning isn't outright ignored, the most
> > likely scenario is that the warning will cause random end users to worry
> > that the setup they've been running for years is broken, when in reality
> > it's probably just fine for their
> > use case.
> 
> Isn't the 'worry' necessary to allow end users evaluate whether "it's
> probably just fine for their use case"?

Realistically, outside of large scale deployments, no end user is going to be able
to make that evaluation, because practically speaking it requires someone with
quite low-level hardware knowledge to be able to make that judgment call.  And
counting by number of human end users (as opposed to number of VMs being run), I
am willing to bet that the overwhelming majority of KVM users aren't kernel or
systems engineers.

Understandably, users tend to be alarmed by (or suspicious of) new warnings that
show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
an internal bug report filed against KVM because they observed a performance
regression when booting a KVM guest, and saw a new message about some CPU
vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.

In short, my concern is that adding a new pr_warn() will generate noise for end
users *and* for KVM developers/maintainers, because even if we phrase the message
to talk specifically about "untrusted workloads", the majority of affected users
will not have the necessary knowledge to make an informed decision.

[*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de

> I saw the old comment already mentioned that doing so may lead to unexpected
> behaviors. But I'm not sure whether such code-level caveat has been visible
> enough to end users.

Another point to consider: KVM is _always_ potentially broken on such CPUs, as
KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
exposed to the guest.

> > I would be quite surprised if there are people running untrusted workloads
> > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > IOMMUs/DMA.
> 
> this is probably true.
> 
> > And anyone exposing a device directly to an untrusted workload really
> > should have done their homework.
> 
> or they run trusted workloads which might be tampered by virus to
> exceed the scope of their homework. 
Yan Zhao March 13, 2024, 1:18 a.m. UTC | #5
On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > Sent: Tuesday, March 12, 2024 8:26 AM
> > > 
> > > On Mon, Mar 11, 2024, Yan Zhao wrote:
> > > > For the case of !static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> > > > kvm_arch_has_noncoherent_dma(vcpu->kvm), I think we at least should warn
> > > > about unsafe before honoring guest memory type.
> > > 
> > > I don't think it gains us enough to offset the potential pain such a
> > > message would bring.  Assuming the warning isn't outright ignored, the most
> > > likely scenario is that the warning will cause random end users to worry
> > > that the setup they've been running for years is broken, when in reality
> > > it's probably just fine for their
> > > use case.
> > 
> > Isn't the 'worry' necessary to allow end users evaluate whether "it's
> > probably just fine for their use case"?
> 
> Realistically, outside of large scale deployments, no end user is going to be able
> to make that evaluation, because practically speaking it requires someone with
> quite low-level hardware knowledge to be able to make that judgment call.  And
> counting by number of human end users (as opposed to number of VMs being run), I
> am willing to bet that the overwhelming majority of KVM users aren't kernel or
> systems engineers.
> 
> Understandably, users tend to be alarmed by (or suspicious of) new warnings that
> show up.  E.g. see the ancient KVM_SET_TSS_ADDR pr_warn[*].  And recently, we had
> an internal bug report filed against KVM because they observed a performance
> regression when booting a KVM guest, and saw a new message about some CPU
> vulnerability being mitigated on VM-Exit that showed up in their *guest* kernel.
> 
> In short, my concern is that adding a new pr_warn() will generate noise for end
> users *and* for KVM developers/maintainers, because even if we phrase the message
> to talk specifically about "untrusted workloads", the majority of affected users
> will not have the necessary knowledge to make an informed decision.
> 
> [*] https://lore.kernel.org/all/f1afa6c0-cde2-ab8b-ea71-bfa62a45b956@tarent.de
> 
> > I saw the old comment already mentioned that doing so may lead to unexpected
> > behaviors. But I'm not sure whether such code-level caveat has been visible
> > enough to end users.
>
What about add a new module parameter to turn on honoring guest for
non-coherent DMAs on CPUs without self-snoop?
A previous example is VFIO's "allow_unsafe_interrupts" parameter.

> Another point to consider: KVM is _always_ potentially broken on such CPUs, as
> KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if the
> host has guest memory mapped as non-WB in the PAT, without non-coherent DMA
> exposed to the guest.
In this case, memory aliasing may only lead to guest not function well, since
guest is not using WC/UC (which can bypass host initialization data in cache).
But if guest has any chance to read information not intended to it, I believe
we need to fix it as well.


> > > I would be quite surprised if there are people running untrusted workloads
> > > on 10+ year old silicon *and* have passthrough devices and non-coherent
> > > IOMMUs/DMA.
What if the guest is a totally malicious one?
Previously we trust the guest in the case of noncoherent DMA is because
we believe a malicious guest will only meet data corruption and shoot his own
foot.

But as Jason raised the security problem in another mail thread [1],
this will expose security hole if CPUs have no self-snoop. So, we need
to fix it, right?
+ Jason, in case I didn't understand this problem correctly.

[1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

> > this is probably true.
> > 
> > > And anyone exposing a device directly to an untrusted workload really
> > > should have done their homework.
> > 
> > or they run trusted workloads which might be tampered by virus to
> > exceed the scope of their homework. 
Tian, Kevin March 13, 2024, 8:52 a.m. UTC | #6
> From: Zhao, Yan Y <yan.y.zhao@intel.com>
> Sent: Wednesday, March 13, 2024 9:19 AM
> 
> On Tue, Mar 12, 2024 at 09:07:11AM -0700, Sean Christopherson wrote:
> > On Tue, Mar 12, 2024, Kevin Tian wrote:
> > > I saw the old comment already mentioned that doing so may lead to
> unexpected
> > > behaviors. But I'm not sure whether such code-level caveat has been
> visible
> > > enough to end users.
> >
> What about add a new module parameter to turn on honoring guest for
> non-coherent DMAs on CPUs without self-snoop?
> A previous example is VFIO's "allow_unsafe_interrupts" parameter.

Not sure whether such parameter has a real value.

If it's default 'off' then you break those 10yr+ setups and it's unacceptable.

If it's default 'on' then same effect as this patch does then I'm not sure who'd
want to turn it off afterwards. Somebody aware of such limitation can simply
avoid assigning device w/ non-coherent DMA in VM config file instead of 
further going to toggle the module parameter to prevent something which
he already knows not to do.

> 
> > Another point to consider: KVM is _always_ potentially broken on such
> CPUs, as
> > KVM forces WB for guest accesses.  I.e. KVM will create memory aliasing if
> the
> > host has guest memory mapped as non-WB in the PAT, without non-
> coherent DMA
> > exposed to the guest.
> In this case, memory aliasing may only lead to guest not function well, since
> guest is not using WC/UC (which can bypass host initialization data in cache).
> But if guest has any chance to read information not intended to it, I believe
> we need to fix it as well.

Having cache/memory inconsistent could hurt both guest and host.

So in concept forcing WB instead of following host attribute on such CPUs
is kind of broken, though in reality we may not see an usage of exposing
non-WB memory to guest on those old setups as discussed for virtio-gpu case.

> 
> 
> > > > I would be quite surprised if there are people running untrusted
> workloads
> > > > on 10+ year old silicon *and* have passthrough devices and non-
> coherent
> > > > IOMMUs/DMA.
> What if the guest is a totally malicious one?
> Previously we trust the guest in the case of noncoherent DMA is because
> we believe a malicious guest will only meet data corruption and shoot his
> own
> foot.
> 
> But as Jason raised the security problem in another mail thread [1],
> this will expose security hole if CPUs have no self-snoop. So, we need
> to fix it, right?
> + Jason, in case I didn't understand this problem correctly.
> 
> [1] https://lore.kernel.org/all/20240108153818.GK50406@nvidia.com/

We'll certain fix the security hole on CPUs w/ self-snoop. In this case
CPU accesses are guaranteed to be coherent and the vulnerability can
only be exposed via non-coherent DMA which is supposed to be fixed
by your coming series. 

But for old CPUs w/o self-snoop the hole can be exploited using either CPU
or non-coherent DMA once the guest PAT is honored. As long as nobody
is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
by Sean that sustaining the old behavior is probably the best option...
Yan Zhao March 13, 2024, 8:55 a.m. UTC | #7
> We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> CPU accesses are guaranteed to be coherent and the vulnerability can
> only be exposed via non-coherent DMA which is supposed to be fixed
> by your coming series. 
> 
> But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> or non-coherent DMA once the guest PAT is honored. As long as nobody
> is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
We can cook a patch to check CPU self-snoop and force WB in EPT even for
non-coherent DMA if no self-snoop. Then back porting such a patch together
with the IOMMU side mitigation for non-coherent DMA.

Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
no self-snoop.

> by Sean that sustaining the old behavior is probably the best option...
Yes, as long as we think exposing secuirty hole on those platforms is acceptable.
Sean Christopherson March 13, 2024, 3:09 p.m. UTC | #8
On Wed, Mar 13, 2024, Yan Zhao wrote:
> > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > CPU accesses are guaranteed to be coherent and the vulnerability can
> > only be exposed via non-coherent DMA which is supposed to be fixed
> > by your coming series. 
> > 
> > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> We can cook a patch to check CPU self-snoop and force WB in EPT even for
> non-coherent DMA if no self-snoop. Then back porting such a patch together
> with the IOMMU side mitigation for non-coherent DMA.

Please don't.  This is a "let sleeping dogs lie" situation.

  let sleeping dogs lie - avoid interfering in a situation that is currently
  causing no problems but might do so as a result of such interference.

Yes, there is technically a flaw, but we have zero evidence that anyone cares or
that it is actually problematic in practice.  On the other hand, any functional
change we make has a non-zero changes of breaking existing setups that have worked
for many years. 

> Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> no self-snoop.
> 
> > by Sean that sustaining the old behavior is probably the best option...
> Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 

Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
I think it is a reasonable risk.

Being 100% secure is simply not possible.  Security is often about balancing the
risk/threat against the cost.  In this case, the risk is low (old hardware,
uncommon setup for untrusted guests, small window of opportunity, and limited
data exposure), whereas the cost is high (decent chance of breaking existing VMs).
Yan Zhao March 14, 2024, 12:12 a.m. UTC | #9
On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > by your coming series. 
> > > 
> > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > with the IOMMU side mitigation for non-coherent DMA.
> 
> Please don't.  This is a "let sleeping dogs lie" situation.
> 
>   let sleeping dogs lie - avoid interfering in a situation that is currently
>   causing no problems but might do so as a result of such interference.
> 
> Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> that it is actually problematic in practice.  On the other hand, any functional
> change we make has a non-zero changes of breaking existing setups that have worked
> for many years. 
> 
> > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > no self-snoop.
> > 
> > > by Sean that sustaining the old behavior is probably the best option...
> > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> 
> Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> I think it is a reasonable risk.
> 
> Being 100% secure is simply not possible.  Security is often about balancing the
> risk/threat against the cost.  In this case, the risk is low (old hardware,
> uncommon setup for untrusted guests, small window of opportunity, and limited
> data exposure), whereas the cost is high (decent chance of breaking existing VMs).
Ok, thanks for explanation!
I still have one last question: if in future there are CPUs with no selfsnoop
(for some unknown reason, or just paranoid), do we allow this unsafe honoring of
guest memory type for non-coherent DMAs?
Sean Christopherson March 14, 2024, 1 a.m. UTC | #10
On Thu, Mar 14, 2024, Yan Zhao wrote:
> On Wed, Mar 13, 2024 at 08:09:28AM -0700, Sean Christopherson wrote:
> > On Wed, Mar 13, 2024, Yan Zhao wrote:
> > > > We'll certain fix the security hole on CPUs w/ self-snoop. In this case
> > > > CPU accesses are guaranteed to be coherent and the vulnerability can
> > > > only be exposed via non-coherent DMA which is supposed to be fixed
> > > > by your coming series. 
> > > > 
> > > > But for old CPUs w/o self-snoop the hole can be exploited using either CPU
> > > > or non-coherent DMA once the guest PAT is honored. As long as nobody
> > > > is willing to actually fix the CPU path (is it possible?) I'm kind of convinced
> > > We can cook a patch to check CPU self-snoop and force WB in EPT even for
> > > non-coherent DMA if no self-snoop. Then back porting such a patch together
> > > with the IOMMU side mitigation for non-coherent DMA.
> > 
> > Please don't.  This is a "let sleeping dogs lie" situation.
> > 
> >   let sleeping dogs lie - avoid interfering in a situation that is currently
> >   causing no problems but might do so as a result of such interference.
> > 
> > Yes, there is technically a flaw, but we have zero evidence that anyone cares or
> > that it is actually problematic in practice.  On the other hand, any functional
> > change we make has a non-zero changes of breaking existing setups that have worked
> > for many years. 
> > 
> > > Otherwise, IOMMU side mitigation alone is meaningless for platforms of CPU of
> > > no self-snoop.
> > > 
> > > > by Sean that sustaining the old behavior is probably the best option...
> > > Yes, as long as we think exposing secuirty hole on those platforms is acceptable. 
> > 
> > Yes, I think it's acceptable.  Obviously not ideal, but given the alternatives,
> > I think it is a reasonable risk.
> > 
> > Being 100% secure is simply not possible.  Security is often about balancing the
> > risk/threat against the cost.  In this case, the risk is low (old hardware,
> > uncommon setup for untrusted guests, small window of opportunity, and limited
> > data exposure), whereas the cost is high (decent chance of breaking existing VMs).
> Ok, thanks for explanation!
> I still have one last question: if in future there are CPUs with no selfsnoop
> (for some unknown reason, or just paranoid),

Don't jinx us :-)

> do we allow this unsafe honoring of guest memory type for non-coherent DMAs? 

Hmm, no?  AIUI, the intent is that all future CPUs will support self-snoop.
Assuming that's the case, then I think it's fair to put the onus on Intel to not
have escapes, i.e. to not end up shipping CPUs with erratas.  Then, if an erratum
does come along, we can make a decision, e.g. allow older CPUs by default, but
require an admin opt-in for new CPUs.

But let's just hope that's a problem we never have to deal with.  :-D
Chao Gao March 25, 2024, 3:43 a.m. UTC | #11
On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
>Unconditionally honor guest PAT on CPUs that support self-snoop, as
>Intel has confirmed that CPUs that support self-snoop always snoop caches
>and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
>even in the presence of aliased memtypes, thus there is no need to trust
>the guest behaves and only honor PAT as a last resort, as KVM does today.
>
>Honoring guest PAT is desirable for use cases where the guest has access
>to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
>(mediated, for all intents and purposes) GPU is exposed to the guest, along
>with buffers that are consumed directly by the physical GPU, i.e. which
>can't be proxied by the host to ensure writes from the guest are performed
>with the correct memory type for the GPU.
>
>Cc: Yiwei Zhang <zzyiwei@google.com>
>Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
>Suggested-by: Kevin Tian <kevin.tian@intel.com>
>Signed-off-by: Sean Christopherson <seanjc@google.com>
>---
> arch/x86/kvm/mmu/mmu.c |  8 +++++---
> arch/x86/kvm/vmx/vmx.c | 10 ++++++----
> 2 files changed, 11 insertions(+), 7 deletions(-)
>
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index 403cd8f914cd..7fa514830628 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -4622,14 +4622,16 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
> bool kvm_mmu_may_ignore_guest_pat(void)
> {
> 	/*
>-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
>+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
>+	 * not support self-snoop (or is affected by an erratum), and the VM
> 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
> 	 * honor the memtype from the guest's PAT so that guest accesses to
> 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
> 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
>-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
>+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
>+	 * bits in response to non-coherent device (un)registration.
> 	 */
>-	return shadow_memtype_mask;
>+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
> }
> 
> int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 17a8e4fdf9c4..5dc4c24ae203 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> 
> 	/*
> 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
>-	 * device attached.  Letting the guest control memory types on Intel
>-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
>-	 * the guest to behave only as a last resort.
>+	 * device attached and the CPU doesn't support self-snoop.  Letting the
>+	 * guest control memory types on Intel CPUs without self-snoop may
>+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
>+	 * trust the guest to behave only as a last resort.
> 	 */
>-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
>+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
>+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
launch DOS attacks on other CPUs on host. isn't it a problem?
Sean Christopherson April 1, 2024, 10:29 p.m. UTC | #12
On Mon, Mar 25, 2024, Chao Gao wrote:
> On Fri, Mar 08, 2024 at 05:09:29PM -0800, Sean Christopherson wrote:
> >Unconditionally honor guest PAT on CPUs that support self-snoop, as
> >Intel has confirmed that CPUs that support self-snoop always snoop caches
> >and store buffers.  I.e. CPUs with self-snoop maintain cache coherency
> >even in the presence of aliased memtypes, thus there is no need to trust
> >the guest behaves and only honor PAT as a last resort, as KVM does today.
> >
> >Honoring guest PAT is desirable for use cases where the guest has access
> >to non-coherent DMA _without_ bouncing through VFIO, e.g. when a virtual
> >(mediated, for all intents and purposes) GPU is exposed to the guest, along
> >with buffers that are consumed directly by the physical GPU, i.e. which
> >can't be proxied by the host to ensure writes from the guest are performed
> >with the correct memory type for the GPU.

...

> > int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 17a8e4fdf9c4..5dc4c24ae203 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7605,11 +7605,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > 
> > 	/*
> > 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
> >-	 * device attached.  Letting the guest control memory types on Intel
> >-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
> >-	 * the guest to behave only as a last resort.
> >+	 * device attached and the CPU doesn't support self-snoop.  Letting the
> >+	 * guest control memory types on Intel CPUs without self-snoop may
> >+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
> >+	 * trust the guest to behave only as a last resort.
> > 	 */
> >-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
> >+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
> >+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
> W/ this change, guests w/o pass-thru devices can also access UC memory. Locking
> UC memory leads to bus lock. So, guests w/o pass-thru devices can potentially
> launch DOS attacks on other CPUs on host. isn't it a problem?

Guests can already trigger bus locks with atomic accesses that split cache lines.
And SPR adds bus lock detection.  So practically speaking, I'm pretty sure ICX is
the only CPU where anything close to a novel attack is possible.  And FWIW, such
an attack is already possible on AMD.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 403cd8f914cd..7fa514830628 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4622,14 +4622,16 @@  static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 bool kvm_mmu_may_ignore_guest_pat(void)
 {
 	/*
-	 * When EPT is enabled (shadow_memtype_mask is non-zero), and the VM
+	 * When EPT is enabled (shadow_memtype_mask is non-zero), the CPU does
+	 * not support self-snoop (or is affected by an erratum), and the VM
 	 * has non-coherent DMA (DMA doesn't snoop CPU caches), KVM's ABI is to
 	 * honor the memtype from the guest's PAT so that guest accesses to
 	 * memory that is DMA'd aren't cached against the guest's wishes.  As a
 	 * result, KVM _may_ ignore guest PAT, whereas without non-coherent DMA,
-	 * KVM _always_ ignores guest PAT (when EPT is enabled).
+	 * KVM _always_ ignores or honors guest PAT, i.e. doesn't toggle SPTE
+	 * bits in response to non-coherent device (un)registration.
 	 */
-	return shadow_memtype_mask;
+	return !static_cpu_has(X86_FEATURE_SELFSNOOP) && shadow_memtype_mask;
 }
 
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 17a8e4fdf9c4..5dc4c24ae203 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7605,11 +7605,13 @@  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	/*
 	 * Force WB and ignore guest PAT if the VM does NOT have a non-coherent
-	 * device attached.  Letting the guest control memory types on Intel
-	 * CPUs may result in unexpected behavior, and so KVM's ABI is to trust
-	 * the guest to behave only as a last resort.
+	 * device attached and the CPU doesn't support self-snoop.  Letting the
+	 * guest control memory types on Intel CPUs without self-snoop may
+	 * result in unexpected behavior, and so KVM's (historical) ABI is to
+	 * trust the guest to behave only as a last resort.
 	 */
-	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!static_cpu_has(X86_FEATURE_SELFSNOOP) &&
+	    !kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
 	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT);