diff mbox series

[08/16] KVM: x86/mmu: Bug the VM if kvm_zap_gfn_range() is called for TDX

Message ID 20240515005952.3410568-9-rick.p.edgecombe@intel.com (mailing list archive)
State New, archived
Headers show
Series TDX MMU prep series part 1 | expand

Commit Message

Edgecombe, Rick P May 15, 2024, 12:59 a.m. UTC
When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
guest mappings so they can be faulted in with different PTE properties.

For TDX private memory this technique is fundamentally not possible.
Remapping private memory requires the guest to "accept" it, and also the
needed PTE properties are not currently supported by TDX for private
memory.

These CPU features are:
1) MTRR update
2) CR0.CD update
3) Non-coherent DMA status update
4) APICV update

Since they cannot be supported, they should be blocked from being
exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
supported for TDX as it cannot see the guest registers. For APICV
inhibit it in future changes.

Guest MTRR support is more of an interesting case. Supported versions of
the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
it is not possible to fully support the feature. This leaves KVM with a
few options:
 - Support a modified version of the architecture where the caching
   attributes are ignored for private memory.
 - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
   bug.

With the additional consideration that likely guest MTRR support in KVM
will be going away, the later option is the best. Prevent MTRR MSR writes
from calling kvm_zap_gfn_range() in future changes.

Lastly, the most interesting case is non-coherent DMA status updates.
There isn't a way to reject the call. KVM is just notified that there is a
non-coherent DMA device attached, and expected to act accordingly. For
normal VMs today, that means to start respecting guest PAT. However,
recently there has been a proposal to avoid doing this on selfsnoop CPUs
(see link). On such CPUs it should not be problematic to simply always
configure the EPT to honor guest PAT. In future changes TDX can enforce
this behavior for shared memory, resulting in shared memory always
respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
called in this case either.

Unfortunately, this will result in different cache attributes between
private and shared memory, as private memory is always WB and cannot be
changed by the VMM on current TDX modules. But it can't really be helped
while also supporting non-coherent DMA devices.

Since all callers will be prevented from calling kvm_zap_gfn_range() in
future changes, report a bug and terminate the guest if other future
changes to KVM result in triggering kvm_zap_gfn_range() for a TD.

For lack of a better method currently, use kvm_gfn_shared_mask() to
determine if private memory cannot be zapped (as in TDX, the only VM type
that sets it).

Link: https://lore.kernel.org/all/20240309010929.1403984-6-seanjc@google.com/
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
---
TDX MMU Part 1:
 - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
 - Add this KVM_BUG_ON() instead
---
 arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Huang, Kai May 15, 2024, 1:27 p.m. UTC | #1
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> guest mappings so they can be faulted in with different PTE properties.
> 
> For TDX private memory this technique is fundamentally not possible.
> Remapping private memory requires the guest to "accept" it, and also the
> needed PTE properties are not currently supported by TDX for private
> memory.
> 
> These CPU features are:
> 1) MTRR update
> 2) CR0.CD update
> 3) Non-coherent DMA status update
> 4) APICV update
> 
> Since they cannot be supported, they should be blocked from being
> exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> supported for TDX as it cannot see the guest registers. For APICV
> inhibit it in future changes.
> 
> Guest MTRR support is more of an interesting case. Supported versions of
> the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> it is not possible to fully support the feature. This leaves KVM with a
> few options:
>  - Support a modified version of the architecture where the caching
>    attributes are ignored for private memory.
>  - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
>    bug.
> 
> With the additional consideration that likely guest MTRR support in KVM
> will be going away, the later option is the best. Prevent MTRR MSR writes
> from calling kvm_zap_gfn_range() in future changes.
> 
> Lastly, the most interesting case is non-coherent DMA status updates.
> There isn't a way to reject the call. KVM is just notified that there is a
> non-coherent DMA device attached, and expected to act accordingly. For
> normal VMs today, that means to start respecting guest PAT. However,
> recently there has been a proposal to avoid doing this on selfsnoop CPUs
> (see link). On such CPUs it should not be problematic to simply always
> configure the EPT to honor guest PAT. In future changes TDX can enforce
> this behavior for shared memory, resulting in shared memory always
> respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> called in this case either.
> 
> Unfortunately, this will result in different cache attributes between
> private and shared memory, as private memory is always WB and cannot be
> changed by the VMM on current TDX modules. But it can't really be helped
> while also supporting non-coherent DMA devices.
> 
> Since all callers will be prevented from calling kvm_zap_gfn_range() in
> future changes, report a bug and terminate the guest if other future
> changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
> 
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).
> 
> Link: https://lore.kernel.org/all/20240309010929.1403984-6-seanjc@google.com/
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
>  - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
>  - Add this KVM_BUG_ON() instead
> ---
>  arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d5cf5b15a10e..808805b3478d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> -	if (tdp_mmu_enabled)
> +	if (tdp_mmu_enabled) {
> +		/*
> +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
> +		 * type was changed.  TDX can't handle zapping the private
> +		 * mapping, but it's ok because KVM doesn't support either of
> +		 * those features for TDX. In case a new caller appears, BUG
> +		 * the VM if it's called for solutions with private aliases.
> +		 */
> +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>  		flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
> +	}
>  
>  	if (flush)
>  		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);

kvm_zap_gfn_range() looks a generic function.  I think it makes more sense
to let the callers to explicitly check whether VM is TDX guest and do the
KVM_BUG_ON()?
Edgecombe, Rick P May 15, 2024, 3:22 p.m. UTC | #2
On Wed, 2024-05-15 at 13:27 +0000, Huang, Kai wrote:
> 
> kvm_zap_gfn_range() looks a generic function.  I think it makes more sense
> to let the callers to explicitly check whether VM is TDX guest and do the
> KVM_BUG_ON()?

Other TDX changes will prevent this function getting called. So basically like
you are suggesting. This change is to catch any new cases that pop up, which we
can't do at the caller.
Sean Christopherson May 15, 2024, 3:34 p.m. UTC | #3
On Tue, May 14, 2024, Rick Edgecombe wrote:
> When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> guest mappings so they can be faulted in with different PTE properties.
> 
> For TDX private memory this technique is fundamentally not possible.
> Remapping private memory requires the guest to "accept" it, and also the
> needed PTE properties are not currently supported by TDX for private
> memory.
> 
> These CPU features are:
> 1) MTRR update
> 2) CR0.CD update
> 3) Non-coherent DMA status update

Please go review the series that removes these disaster[*], I suspect it would
literally have taken less time than writing this changelog :-)

[*] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com

> 4) APICV update
> 
> Since they cannot be supported, they should be blocked from being
> exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> supported for TDX as it cannot see the guest registers. For APICV
> inhibit it in future changes.
> 
> Guest MTRR support is more of an interesting case. Supported versions of
> the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> it is not possible to fully support the feature. This leaves KVM with a
> few options:
>  - Support a modified version of the architecture where the caching
>    attributes are ignored for private memory.
>  - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
>    bug.
> 
> With the additional consideration that likely guest MTRR support in KVM
> will be going away, the later option is the best. Prevent MTRR MSR writes
> from calling kvm_zap_gfn_range() in future changes.
> 
> Lastly, the most interesting case is non-coherent DMA status updates.
> There isn't a way to reject the call. KVM is just notified that there is a
> non-coherent DMA device attached, and expected to act accordingly. For
> normal VMs today, that means to start respecting guest PAT. However,
> recently there has been a proposal to avoid doing this on selfsnoop CPUs
> (see link). On such CPUs it should not be problematic to simply always
> configure the EPT to honor guest PAT. In future changes TDX can enforce
> this behavior for shared memory, resulting in shared memory always
> respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> called in this case either.
> 
> Unfortunately, this will result in different cache attributes between
> private and shared memory, as private memory is always WB and cannot be
> changed by the VMM on current TDX modules. But it can't really be helped
> while also supporting non-coherent DMA devices.
> 
> Since all callers will be prevented from calling kvm_zap_gfn_range() in
> future changes, report a bug and terminate the guest if other future
> changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
> 
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).
> 
> Link: https://lore.kernel.org/all/20240309010929.1403984-6-seanjc@google.com/
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> ---
> TDX MMU Part 1:
>  - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
>  - Add this KVM_BUG_ON() instead
> ---
>  arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d5cf5b15a10e..808805b3478d 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  
>  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>  
> -	if (tdp_mmu_enabled)
> +	if (tdp_mmu_enabled) {
> +		/*
> +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
> +		 * type was changed.  TDX can't handle zapping the private
> +		 * mapping, but it's ok because KVM doesn't support either of
> +		 * those features for TDX. In case a new caller appears, BUG
> +		 * the VM if it's called for solutions with private aliases.
> +		 */
> +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);

Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
generic name quite obviously doesn't prevent TDX details for bleeding into common
code, and dancing around things just makes it all unnecessarily confusing.

If we can't avoid bleeding TDX details into common code, my vote is to bite the
bullet and simply check vm_type.

This KVM_BUG_ON() also should not be in the tdp_mmu_enabled path.  Yeah, yeah,
TDX is restricted to the TDP MMU, but there's no reason to bleed that detail all
over the place.

>  		flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
> +	}
>  
>  	if (flush)
>  		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);
> -- 
> 2.34.1
>
Edgecombe, Rick P May 15, 2024, 3:49 p.m. UTC | #4
On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> On Tue, May 14, 2024, Rick Edgecombe wrote:
> > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > guest mappings so they can be faulted in with different PTE properties.
> > 
> > For TDX private memory this technique is fundamentally not possible.
> > Remapping private memory requires the guest to "accept" it, and also the
> > needed PTE properties are not currently supported by TDX for private
> > memory.
> > 
> > These CPU features are:
> > 1) MTRR update
> > 2) CR0.CD update
> > 3) Non-coherent DMA status update
> 
> Please go review the series that removes these disaster[*], I suspect it would
> literally have taken less time than writing this changelog :-)
> 
> [*] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com

We have one additional detail for TDX in that KVM will have different cache
attributes between private and shared. Although implementation is in a later
patch, that detail has an affect on whether we need to support zapping in the
basic MMU support.

> 
> > 4) APICV update
> > 
> > Since they cannot be supported, they should be blocked from being
> > exercised by a TD. In the case of CRO.CD, the feature is fundamentally not
> > supported for TDX as it cannot see the guest registers. For APICV
> > inhibit it in future changes.
> > 
> > Guest MTRR support is more of an interesting case. Supported versions of
> > the TDX module fix the MTRR CPUID bit to 1, but as previously discussed,
> > it is not possible to fully support the feature. This leaves KVM with a
> > few options:
> >  - Support a modified version of the architecture where the caching
> >    attributes are ignored for private memory.
> >  - Don't support MTRRs and treat the set MTRR CPUID bit as a TDX Module
> >    bug.
> > 
> > With the additional consideration that likely guest MTRR support in KVM
> > will be going away, the later option is the best. Prevent MTRR MSR writes
> > from calling kvm_zap_gfn_range() in future changes.
> > 
> > Lastly, the most interesting case is non-coherent DMA status updates.
> > There isn't a way to reject the call. KVM is just notified that there is a
> > non-coherent DMA device attached, and expected to act accordingly. For
> > normal VMs today, that means to start respecting guest PAT. However,
> > recently there has been a proposal to avoid doing this on selfsnoop CPUs
> > (see link). On such CPUs it should not be problematic to simply always
> > configure the EPT to honor guest PAT. In future changes TDX can enforce
> > this behavior for shared memory, resulting in shared memory always
> > respecting guest PAT for TDX. So kvm_zap_gfn_range() will not need to be
> > called in this case either.
> > 
> > Unfortunately, this will result in different cache attributes between
> > private and shared memory, as private memory is always WB and cannot be
> > changed by the VMM on current TDX modules. But it can't really be helped
> > while also supporting non-coherent DMA devices.
> > 
> > Since all callers will be prevented from calling kvm_zap_gfn_range() in
> > future changes, report a bug and terminate the guest if other future
> > changes to KVM result in triggering kvm_zap_gfn_range() for a TD.
> > 
> > For lack of a better method currently, use kvm_gfn_shared_mask() to
> > determine if private memory cannot be zapped (as in TDX, the only VM type
> > that sets it).
> > 
> > Link:
> > https://lore.kernel.org/all/20240309010929.1403984-6-seanjc@google.com/
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > ---
> > TDX MMU Part 1:
> >  - Remove support from "KVM: x86/tdp_mmu: Zap leafs only for private memory"
> >  - Add this KVM_BUG_ON() instead
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 11 ++++++++++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d5cf5b15a10e..808805b3478d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t
> > gfn_start, gfn_t gfn_end)
> >  
> >         flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >  
> > -       if (tdp_mmu_enabled)
> > +       if (tdp_mmu_enabled) {
> > +               /*
> > +                * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > +                * type was changed.  TDX can't handle zapping the private
> > +                * mapping, but it's ok because KVM doesn't support either
> > of
> > +                * those features for TDX. In case a new caller appears, BUG
> > +                * the VM if it's called for solutions with private aliases.
> > +                */
> > +               KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> 
> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
> generic name quite obviously doesn't prevent TDX details for bleeding into
> common
> code, and dancing around things just makes it all unnecessarily confusing.

Ok, yep on the general point.

> 
> If we can't avoid bleeding TDX details into common code, my vote is to bite
> the
> bullet and simply check vm_type.

In this case the generic property is the inability to re-fault in private memory
whenever we want. However the reason we can get away with just not  is because
TDX won't support the operations that uses this function. Otherwise, we could
zap only the shared half of the memory (depending on the intention of the
caller).

To me KVM_BUG_ON()s seem a little less bad to check vm type specifically. It
doesn't affect the functional part of the code. So all together, I'd lean
towards vm_type in this case.

> 
> This KVM_BUG_ON() also should not be in the tdp_mmu_enabled path.  Yeah, yeah,
> TDX is restricted to the TDP MMU, but there's no reason to bleed that detail
> all
> over the place.
> 
> 

Right.
Edgecombe, Rick P May 15, 2024, 3:56 p.m. UTC | #5
On Wed, 2024-05-15 at 08:49 -0700, Rick Edgecombe wrote:
> On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> > On Tue, May 14, 2024, Rick Edgecombe wrote:
> > > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > > guest mappings so they can be faulted in with different PTE properties.
> > > 
> > > For TDX private memory this technique is fundamentally not possible.
> > > Remapping private memory requires the guest to "accept" it, and also the
> > > needed PTE properties are not currently supported by TDX for private
> > > memory.
> > > 
> > > These CPU features are:
> > > 1) MTRR update
> > > 2) CR0.CD update
> > > 3) Non-coherent DMA status update
> > 
> > Please go review the series that removes these disaster[*], I suspect it
> > would
> > literally have taken less time than writing this changelog :-)
> > 
> > [*] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com
> 
> We have one additional detail for TDX in that KVM will have different cache
> attributes between private and shared. Although implementation is in a later
> patch, that detail has an affect on whether we need to support zapping in the
> basic MMU support.

Or most specifically, we only need this zapping if we *try* to have consistent
cache attributes between private and shared. In the non-coherent DMA case we
can't have them be consistent because TDX doesn't support changing the private
memory in this way.
Sean Christopherson May 15, 2024, 4:02 p.m. UTC | #6
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 08:49 -0700, Rick Edgecombe wrote:
> > On Wed, 2024-05-15 at 08:34 -0700, Sean Christopherson wrote:
> > > On Tue, May 14, 2024, Rick Edgecombe wrote:
> > > > When virtualizing some CPU features, KVM uses kvm_zap_gfn_range() to zap
> > > > guest mappings so they can be faulted in with different PTE properties.
> > > > 
> > > > For TDX private memory this technique is fundamentally not possible.
> > > > Remapping private memory requires the guest to "accept" it, and also the
> > > > needed PTE properties are not currently supported by TDX for private
> > > > memory.
> > > > 
> > > > These CPU features are:
> > > > 1) MTRR update
> > > > 2) CR0.CD update
> > > > 3) Non-coherent DMA status update
> > > 
> > > Please go review the series that removes these disaster[*], I suspect it
> > > would
> > > literally have taken less time than writing this changelog :-)
> > > 
> > > [*] https://lore.kernel.org/all/20240309010929.1403984-1-seanjc@google.com
> > 
> > We have one additional detail for TDX in that KVM will have different cache
> > attributes between private and shared. Although implementation is in a later
> > patch, that detail has an affect on whether we need to support zapping in the
> > basic MMU support.
> 
> Or most specifically, we only need this zapping if we *try* to have consistent
> cache attributes between private and shared. In the non-coherent DMA case we
> can't have them be consistent because TDX doesn't support changing the private
> memory in this way.

Huh?  That makes no sense.  A physical page can't be simultaneously mapped SHARED
and PRIVATE, so there can't be meaningful cache attribute aliasing between private
and shared EPT entries.

Trying to provide consistency for the GPA is like worrying about having matching
PAT entires for the virtual address in two different processes.
Edgecombe, Rick P May 15, 2024, 4:12 p.m. UTC | #7
On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > Or most specifically, we only need this zapping if we *try* to have
> > consistent
> > cache attributes between private and shared. In the non-coherent DMA case we
> > can't have them be consistent because TDX doesn't support changing the
> > private
> > memory in this way.
> 
> Huh?  That makes no sense.  A physical page can't be simultaneously mapped
> SHARED
> and PRIVATE, so there can't be meaningful cache attribute aliasing between
> private
> and shared EPT entries.
> 
> Trying to provide consistency for the GPA is like worrying about having
> matching
> PAT entires for the virtual address in two different processes.

No, not matching between the private and shared mappings of the same page. The
whole private memory will be WB, and the whole shared half will honor PAT.
Isaku Yamahata May 15, 2024, 4:22 p.m. UTC | #8
On Wed, May 15, 2024 at 08:34:37AM -0700,
Sean Christopherson <seanjc@google.com> wrote:

> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index d5cf5b15a10e..808805b3478d 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >  
> >  	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> >  
> > -	if (tdp_mmu_enabled)
> > +	if (tdp_mmu_enabled) {
> > +		/*
> > +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > +		 * type was changed.  TDX can't handle zapping the private
> > +		 * mapping, but it's ok because KVM doesn't support either of
> > +		 * those features for TDX. In case a new caller appears, BUG
> > +		 * the VM if it's called for solutions with private aliases.
> > +		 */
> > +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> 
> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
> generic name quite obviously doesn't prevent TDX details for bleeding into common
> code, and dancing around things just makes it all unnecessarily confusing.
> 
> If we can't avoid bleeding TDX details into common code, my vote is to bite the
> bullet and simply check vm_type.

TDX has several aspects related to the TDP MMU.
1) Based on the faulting GPA, determine which KVM page table to walk.
   (private-vs-shared)
2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
   load/store.  TDP MMU needs hooks for it.
3) The tables must be zapped from the leaf. not the root or the middle.

For 1) and 2), what about something like this?  TDX backend code will set
kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
for address conversion (shared<->private).

For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
        (or whatever preferable name)?

For 3), flag of memslot handles it.

---
 arch/x86/include/asm/kvm_host.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index aabf1648a56a..218b575d24bd 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1289,6 +1289,7 @@ struct kvm_arch {
 	u8 vm_type;
 	bool has_private_mem;
 	bool has_protected_state;
+	bool has_mirrored_pt;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
 	struct list_head zapped_obsolete_pages;
@@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
 
 #ifdef CONFIG_KVM_PRIVATE_MEM
 #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
+#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
 #else
 #define kvm_arch_has_private_mem(kvm) false
+#define kvm_arch_has_mirrored_pt(kvm) false
 #endif
 
 static inline u16 kvm_read_ldt(void)
Sean Christopherson May 15, 2024, 6:09 p.m. UTC | #9
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > > Or most specifically, we only need this zapping if we *try* to have
> > > consistent cache attributes between private and shared. In the
> > > non-coherent DMA case we can't have them be consistent because TDX
> > > doesn't support changing the private memory in this way.
> > 
> > Huh?  That makes no sense.  A physical page can't be simultaneously mapped
> > SHARED and PRIVATE, so there can't be meaningful cache attribute aliasing
> > between private and shared EPT entries.
> > 
> > Trying to provide consistency for the GPA is like worrying about having
> > matching PAT entires for the virtual address in two different processes.
> 
> No, not matching between the private and shared mappings of the same page. The
> whole private memory will be WB, and the whole shared half will honor PAT.

I'm still failing to see why that's at all interesting.  The non-coherent DMA
trainwreck is all about KVM worrying about the guest and host having different
memtypes for the same physical page.

If the host is accessing TDX private memory, we have far, far bigger problems
than aliased memtypes.  And so the fact that TDX private memory is forced WB is
interesting only to the guest, not KVM.
Edgecombe, Rick P May 15, 2024, 6:22 p.m. UTC | #10
On Wed, 2024-05-15 at 11:09 -0700, Sean Christopherson wrote:
> On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > > > Or most specifically, we only need this zapping if we *try* to have
> > > > consistent cache attributes between private and shared. In the
> > > > non-coherent DMA case we can't have them be consistent because TDX
> > > > doesn't support changing the private memory in this way.
> > > 
> > > Huh?  That makes no sense.  A physical page can't be simultaneously mapped
> > > SHARED and PRIVATE, so there can't be meaningful cache attribute aliasing
> > > between private and shared EPT entries.
> > > 
> > > Trying to provide consistency for the GPA is like worrying about having
> > > matching PAT entires for the virtual address in two different processes.
> > 
> > No, not matching between the private and shared mappings of the same page.
> > The
> > whole private memory will be WB, and the whole shared half will honor PAT.
> 
> I'm still failing to see why that's at all interesting.  The non-coherent DMA
> trainwreck is all about KVM worrying about the guest and host having different
> memtypes for the same physical page.

Ok. The split seemed a little odd and special. I'm not sure it's the most
interesting thing in the world, but there was some debate internally about it.

> 
> If the host is accessing TDX private memory, we have far, far bigger problems
> than aliased memtypes.

This wasn't the concern.

>   And so the fact that TDX private memory is forced WB is
> interesting only to the guest, not KVM.

It's just another little quirk in an already complicated solution. They third
thing we discussed was somehow rejecting or not supporting non-coherent DMA.
This seemed simpler than that.
Sean Christopherson May 15, 2024, 7:48 p.m. UTC | #11
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 11:09 -0700, Sean Christopherson wrote:
> > On Wed, May 15, 2024, Rick P Edgecombe wrote:
> > > On Wed, 2024-05-15 at 09:02 -0700, Sean Christopherson wrote:
> > > > > Or most specifically, we only need this zapping if we *try* to have
> > > > > consistent cache attributes between private and shared. In the
> > > > > non-coherent DMA case we can't have them be consistent because TDX
> > > > > doesn't support changing the private memory in this way.
> > > > 
> > > > Huh?  That makes no sense.  A physical page can't be simultaneously mapped
> > > > SHARED and PRIVATE, so there can't be meaningful cache attribute aliasing
> > > > between private and shared EPT entries.
> > > > 
> > > > Trying to provide consistency for the GPA is like worrying about having
> > > > matching PAT entires for the virtual address in two different processes.
> > > 
> > > No, not matching between the private and shared mappings of the same page.
> > > The
> > > whole private memory will be WB, and the whole shared half will honor PAT.
> > 
> > I'm still failing to see why that's at all interesting.  The non-coherent DMA
> > trainwreck is all about KVM worrying about the guest and host having different
> > memtypes for the same physical page.
> 
> Ok. The split seemed a little odd and special. I'm not sure it's the most
> interesting thing in the world, but there was some debate internally about it.
> 
> > 
> > If the host is accessing TDX private memory, we have far, far bigger problems
> > than aliased memtypes.
> 
> This wasn't the concern.
> 
> >   And so the fact that TDX private memory is forced WB is
> > interesting only to the guest, not KVM.
> 
> It's just another little quirk in an already complicated solution. They third
> thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> This seemed simpler than that.

Again, huh?  This has _nothing_ to do with non-coherent DMA.  Devices can't DMA
into TDX private memory.
Edgecombe, Rick P May 15, 2024, 8:32 p.m. UTC | #12
On Wed, 2024-05-15 at 12:48 -0700, Sean Christopherson wrote:
> > It's just another little quirk in an already complicated solution. They
> > third
> > thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> > This seemed simpler than that.
> 
> Again, huh?  This has _nothing_ to do with non-coherent DMA.  Devices can't
> DMA
> into TDX private memory.

Hmm... I'm confused how you are confused... :)

For normal VMs (after that change you linked), guests will honor guest PAT on
newer HW. On older HW it will only honor guest PAT if non-coherent DMA is
attached.

For TDX we can't honor guest PAT for private memory. So we can either have:
1. Have shared honor PAT and private not.
2. Have private and shared both not honor PAT and be consistent. Unless non-
coherent DMA is attached. In that case KVM could zap shared only and switch to
1.

The only benefit of 2 is that in normal conditions the guest will have
consistent cache behavior between private and shared.

FWIW, there was at one time a use for private uncacheable memory proposed. It
was for keeping non-performance sensitive secret data protected from speculative
access. (not for TDX, a general kernel thing). This isn't a real thing today,
but it's an example of how the private/shared split is quirky, when you ask "do
TDs support PAT?".

1 is a little quirky, but 2 is too complex and also quirky. 1 is the best
option.


If it's obvious we can trim down the log. There was a bit of hand wringing on
this one, so seemed relevant to discussion. The other point was to describe why
we don't need to support kvm_zap_gfn_range(). I think that point is worth
review. The KVM_BUG_ON() is not super critical so we could even drop the patch
if its all settled.
Huang, Kai May 15, 2024, 10:17 p.m. UTC | #13
On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> On Wed, May 15, 2024 at 08:34:37AM -0700,
> Sean Christopherson <seanjc@google.com> wrote:
> 
>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>> index d5cf5b15a10e..808805b3478d 100644
>>> --- a/arch/x86/kvm/mmu/mmu.c
>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>>>   
>>>   	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>>>   
>>> -	if (tdp_mmu_enabled)
>>> +	if (tdp_mmu_enabled) {
>>> +		/*
>>> +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
>>> +		 * type was changed.  TDX can't handle zapping the private
>>> +		 * mapping, but it's ok because KVM doesn't support either of
>>> +		 * those features for TDX. In case a new caller appears, BUG
>>> +		 * the VM if it's called for solutions with private aliases.
>>> +		 */
>>> +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>>
>> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
>> generic name quite obviously doesn't prevent TDX details for bleeding into common
>> code, and dancing around things just makes it all unnecessarily confusing.
>>
>> If we can't avoid bleeding TDX details into common code, my vote is to bite the
>> bullet and simply check vm_type.
> 
> TDX has several aspects related to the TDP MMU.
> 1) Based on the faulting GPA, determine which KVM page table to walk.
>     (private-vs-shared)
> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
>     load/store.  TDP MMU needs hooks for it.
> 3) The tables must be zapped from the leaf. not the root or the middle.
> 
> For 1) and 2), what about something like this?  TDX backend code will set
> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> for address conversion (shared<->private).
> 
> For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
>          (or whatever preferable name)?
> 
> For 3), flag of memslot handles it.
> 
> ---
>   arch/x86/include/asm/kvm_host.h | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index aabf1648a56a..218b575d24bd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1289,6 +1289,7 @@ struct kvm_arch {
>   	u8 vm_type;
>   	bool has_private_mem;
>   	bool has_protected_state;
> +	bool has_mirrored_pt;
>   	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>   	struct list_head active_mmu_pages;
>   	struct list_head zapped_obsolete_pages;
> @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>   
>   #ifdef CONFIG_KVM_PRIVATE_MEM
>   #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
>   #else
>   #define kvm_arch_has_private_mem(kvm) false
> +#define kvm_arch_has_mirrored_pt(kvm) false
>   #endif
>   
>   static inline u16 kvm_read_ldt(void)

I think this 'has_mirrored_pt' (or a better name) is better, because it 
clearly conveys it is for the "page table", but not the actual page that 
any page table entry maps to.

AFAICT we need to split the concept of "private page table itself" and 
the "memory type of the actual GFN".

E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), 
but I was told only TDX uses a dedicated private page table which isn't 
directly accessible for KVV.  SEV-SNP on the other hand just uses normal 
page table + additional HW managed table to make sure the security.

In other words, I think we should decide whether to invoke TDP MMU 
callback for private mapping (the page table itself may just be normal 
one) depending on the fault->is_private, but not whether the page table 
is private:

	if (fault->is_private && kvm_x86_ops->set_private_spte)
		kvm_x86_set_private_spte(...);
	else
		tdp_mmu_set_spte_atomic(...);

And the 'has_mirrored_pt' should be only used to select the root of the 
page table that we want to operate on.

This also gives a chance that if there's anything special needs to be 
done for page allocated for the "non-leaf" middle page table for 
SEV-SNP, it can just fit.

Of course, if 'has_mirrored_pt' is true, we can assume there's a special 
way to operate it, i.e., kvm_x86_ops->set_private_spte etc must be valid.
Edgecombe, Rick P May 15, 2024, 11:14 p.m. UTC | #14
On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > TDX has several aspects related to the TDP MMU.
> > 1) Based on the faulting GPA, determine which KVM page table to walk.
> >      (private-vs-shared)
> > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > memory
> >      load/store.  TDP MMU needs hooks for it.
> > 3) The tables must be zapped from the leaf. not the root or the middle.
> > 
> > For 1) and 2), what about something like this?  TDX backend code will set
> > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask()
> > only
> > for address conversion (shared<->private).

1 and 2 are not the same as "mirrored" though. You could have a design that
mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
might be just a KVM design choice, even for TDX.

What we are really trying to do here is not put "is tdx" logic in the generic
code. We could rely on the fact that TDX is the only one with mirrored TDP, but
that is kind of what we are already doing with kvm_gfn_shared_mask().

How about we do helpers for each of your bullets, and they all just check:
vm_type == KVM_X86_TDX_VM

So like:
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a578ea09dfb3..c0beed5b090a 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -355,4 +355,19 @@ static inline bool kvm_is_private_gpa(const struct kvm
*kvm, gpa_t gpa)
        return mask && !(gpa_to_gfn(gpa) & mask);
 }
 
+static inline bool kvm_has_mirrored_tdp(struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_has_private_root(struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
+static inline bool kvm_zap_leafs_only(struct kvm *kvm)
+{
+       return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
 #endif


This is similar to what Sean proposed earlier, we just didn't get that far:
https://lore.kernel.org/kvm/ZhSYEVCHqSOpVKMh@google.com/


> > 
> > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> >           (or whatever preferable name)?
> > 
> > For 3), flag of memslot handles it.
> > 
> > ---
> >    arch/x86/include/asm/kvm_host.h | 3 +++
> >    1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h
> > b/arch/x86/include/asm/kvm_host.h
> > index aabf1648a56a..218b575d24bd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> >         u8 vm_type;
> >         bool has_private_mem;
> >         bool has_protected_state;
> > +       bool has_mirrored_pt;
> >         struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >         struct list_head active_mmu_pages;
> >         struct list_head zapped_obsolete_pages;
> > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int
> > tdp_forced_root_level,
> >    
> >    #ifdef CONFIG_KVM_PRIVATE_MEM
> >    #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> >    #else
> >    #define kvm_arch_has_private_mem(kvm) false
> > +#define kvm_arch_has_mirrored_pt(kvm) false
> >    #endif
> >    
> >    static inline u16 kvm_read_ldt(void)
> 
> I think this 'has_mirrored_pt' (or a better name) is better, because it 
> clearly conveys it is for the "page table", but not the actual page that 
> any page table entry maps to.
> 
> AFAICT we need to split the concept of "private page table itself" and 
> the "memory type of the actual GFN".
> 
> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), 
> but I was told only TDX uses a dedicated private page table which isn't 
> directly accessible for KVV.  SEV-SNP on the other hand just uses normal 
> page table + additional HW managed table to make sure the security.
> 
> In other words, I think we should decide whether to invoke TDP MMU 
> callback for private mapping (the page table itself may just be normal 
> one) depending on the fault->is_private, but not whether the page table 
> is private:
> 
>         if (fault->is_private && kvm_x86_ops->set_private_spte)
>                 kvm_x86_set_private_spte(...);
>         else
>                 tdp_mmu_set_spte_atomic(...);
> 
> And the 'has_mirrored_pt' should be only used to select the root of the 
> page table that we want to operate on.
> 
> This also gives a chance that if there's anything special needs to be 
> done for page allocated for the "non-leaf" middle page table for 
> SEV-SNP, it can just fit.
> 
> Of course, if 'has_mirrored_pt' is true, we can assume there's a special 
> way to operate it, i.e., kvm_x86_ops->set_private_spte etc must be valid.

It's good point that we are mixing up "private" in the code from SNP's
perspective.
Huang, Kai May 15, 2024, 11:14 p.m. UTC | #15
On 16/05/2024 3:22 am, Edgecombe, Rick P wrote:
> On Wed, 2024-05-15 at 13:27 +0000, Huang, Kai wrote:
>>
>> kvm_zap_gfn_range() looks a generic function.  I think it makes more sense
>> to let the callers to explicitly check whether VM is TDX guest and do the
>> KVM_BUG_ON()?
> 
> Other TDX changes will prevent this function getting called. So basically like
> you are suggesting. This change is to catch any new cases that pop up, which we
> can't do at the caller.

But I think we need to see whether calling kvm_zap_gfn_range() is legal 
or not for TDX guest case by case, but not having a universal rule that 
this cannot be called for TDX guest, right?
Sean Christopherson May 15, 2024, 11:26 p.m. UTC | #16
On Wed, May 15, 2024, Rick P Edgecombe wrote:
> On Wed, 2024-05-15 at 12:48 -0700, Sean Christopherson wrote:
> > > It's just another little quirk in an already complicated solution. They
> > > third
> > > thing we discussed was somehow rejecting or not supporting non-coherent DMA.
> > > This seemed simpler than that.
> > 
> > Again, huh?  This has _nothing_ to do with non-coherent DMA.  Devices can't
> > DMA
> > into TDX private memory.
> 
> Hmm... I'm confused how you are confused... :)
> 
> For normal VMs (after that change you linked), guests will honor guest PAT on
> newer HW. On older HW it will only honor guest PAT if non-coherent DMA is
> attached.
> 
> For TDX we can't honor guest PAT for private memory. So we can either have:
> 1. Have shared honor PAT and private not.
> 2. Have private and shared both not honor PAT and be consistent. Unless non-
> coherent DMA is attached. In that case KVM could zap shared only and switch to
> 1.

Oh good gravy, hell no :-)
Huang, Kai May 15, 2024, 11:38 p.m. UTC | #17
On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
>>> TDX has several aspects related to the TDP MMU.
>>> 1) Based on the faulting GPA, determine which KVM page table to walk.
>>>       (private-vs-shared)
>>> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
>>> memory
>>>       load/store.  TDP MMU needs hooks for it.
>>> 3) The tables must be zapped from the leaf. not the root or the middle.
>>>
>>> For 1) and 2), what about something like this?  TDX backend code will set
>>> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask()
>>> only
>>> for address conversion (shared<->private).
> 
> 1 and 2 are not the same as "mirrored" though. You could have a design that
> mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> might be just a KVM design choice, even for TDX.

I am not sure whether I understand this correctly.  If they are not 
tracked with separate roots, it means they use the same page table (root).

So IIUC what you said is to support "mirror PT" at any sub-tree of the 
page table?

That will only complicate things.  I don't think we should consider 
this.  In reality, we only have TDX and SEV-SNP.  We should have a 
simple solution to cover both of them.

> 
> What we are really trying to do here is not put "is tdx" logic in the generic
> code. We could rely on the fact that TDX is the only one with mirrored TDP, but
> that is kind of what we are already doing with kvm_gfn_shared_mask().
> 
> How about we do helpers for each of your bullets, and they all just check:
> vm_type == KVM_X86_TDX_VM
> 
> So like:
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index a578ea09dfb3..c0beed5b090a 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -355,4 +355,19 @@ static inline bool kvm_is_private_gpa(const struct kvm
> *kvm, gpa_t gpa)
>          return mask && !(gpa_to_gfn(gpa) & mask);
>   }
>   
> +static inline bool kvm_has_mirrored_tdp(struct kvm *kvm)
> +{
> +       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +
> +static inline bool kvm_has_private_root(struct kvm *kvm)
> +{
> +       return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}

I don't think we need to distinguish the two.

Even we do this, if I understand your saying correctly, 
kvm_has_private_root() isn't just enough -- theoretically we can have a 
mirror pt at a sub-tree at any level.
Edgecombe, Rick P May 16, 2024, 12:13 a.m. UTC | #18
On Thu, 2024-05-16 at 11:38 +1200, Huang, Kai wrote:
> 
> 
> On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> > On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > > > TDX has several aspects related to the TDP MMU.
> > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > >        (private-vs-shared)
> > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > > > memory
> > > >        load/store.  TDP MMU needs hooks for it.
> > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > > 
> > > > For 1) and 2), what about something like this?  TDX backend code will
> > > > set
> > > > kvm->arch.has_mirrored_pt = true; I think we will use
> > > > kvm_gfn_shared_mask()
> > > > only
> > > > for address conversion (shared<->private).
> > 
> > 1 and 2 are not the same as "mirrored" though. You could have a design that
> > mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> > might be just a KVM design choice, even for TDX.
> 
> I am not sure whether I understand this correctly.  If they are not 
> tracked with separate roots, it means they use the same page table (root).

There are three roots, right? Shared, private and mirrored. Shared and mirrored
don't have to be different roots, but it makes some operations arguably easier
to have it that way.

> 
> So IIUC what you said is to support "mirror PT" at any sub-tree of the 
> page table?
> 
> That will only complicate things.  I don't think we should consider 
> this.  In reality, we only have TDX and SEV-SNP.  We should have a 
> simple solution to cover both of them.

Look at "bool is_private" in kvm_tdp_mmu_map(). Do you see how it switches
between different roots in the iterator? That is one use.

The second use is to decide whether to call out to the x86_ops. It happens via
the role bit in the sp, which is copied from the parent sp role. The root's bit
is set originally via a kvm_gfn_shared_mask() check.

BTW, the role bit is the thing I'm wondering if we really need, because we have
shared_mask. While the shared_mask is used for lots of things today, we need
still need it for masking GPAs. Where as the role bit is only needed to know if
a SP is for private (which we can tell from the GPA).
Isaku Yamahata May 16, 2024, 12:15 a.m. UTC | #19
On Thu, May 16, 2024 at 10:17:50AM +1200,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> > On Wed, May 15, 2024 at 08:34:37AM -0700,
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index d5cf5b15a10e..808805b3478d 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > >   	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > > > -	if (tdp_mmu_enabled)
> > > > +	if (tdp_mmu_enabled) {
> > > > +		/*
> > > > +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > > > +		 * type was changed.  TDX can't handle zapping the private
> > > > +		 * mapping, but it's ok because KVM doesn't support either of
> > > > +		 * those features for TDX. In case a new caller appears, BUG
> > > > +		 * the VM if it's called for solutions with private aliases.
> > > > +		 */
> > > > +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> > > 
> > > Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
> > > generic name quite obviously doesn't prevent TDX details for bleeding into common
> > > code, and dancing around things just makes it all unnecessarily confusing.
> > > 
> > > If we can't avoid bleeding TDX details into common code, my vote is to bite the
> > > bullet and simply check vm_type.
> > 
> > TDX has several aspects related to the TDP MMU.
> > 1) Based on the faulting GPA, determine which KVM page table to walk.
> >     (private-vs-shared)
> > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
> >     load/store.  TDP MMU needs hooks for it.
> > 3) The tables must be zapped from the leaf. not the root or the middle.
> > 
> > For 1) and 2), what about something like this?  TDX backend code will set
> > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> > for address conversion (shared<->private).
> > 
> > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> >          (or whatever preferable name)?
> > 
> > For 3), flag of memslot handles it.
> > 
> > ---
> >   arch/x86/include/asm/kvm_host.h | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index aabf1648a56a..218b575d24bd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> >   	u8 vm_type;
> >   	bool has_private_mem;
> >   	bool has_protected_state;
> > +	bool has_mirrored_pt;
> >   	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >   	struct list_head active_mmu_pages;
> >   	struct list_head zapped_obsolete_pages;
> > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> >   #ifdef CONFIG_KVM_PRIVATE_MEM
> >   #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> >   #else
> >   #define kvm_arch_has_private_mem(kvm) false
> > +#define kvm_arch_has_mirrored_pt(kvm) false
> >   #endif
> >   static inline u16 kvm_read_ldt(void)
> 
> I think this 'has_mirrored_pt' (or a better name) is better, because it
> clearly conveys it is for the "page table", but not the actual page that any
> page table entry maps to.
> 
> AFAICT we need to split the concept of "private page table itself" and the
> "memory type of the actual GFN".
> 
> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
> I was told only TDX uses a dedicated private page table which isn't directly
> accessible for KVV.  SEV-SNP on the other hand just uses normal page table +
> additional HW managed table to make sure the security.

kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
need_callback or whatever makes sense.


> In other words, I think we should decide whether to invoke TDP MMU callback
> for private mapping (the page table itself may just be normal one) depending
> on the fault->is_private, but not whether the page table is private:
> 
> 	if (fault->is_private && kvm_x86_ops->set_private_spte)
> 		kvm_x86_set_private_spte(...);
> 	else
> 		tdp_mmu_set_spte_atomic(...);

This doesn't work for two reasons.

- We need to pass down struct kvm_page_fault fault deep only for this.
  We could change the code in such way.

- We don't have struct kvm_page_fault fault for zapping case.
  We could create a dummy one and pass it around.
  
Essentially the issue is how to pass down is_private or stash the info
somewhere or determine it somehow.  Options I think of are

- Pass around fault:
  Con: fault isn't passed down 
  Con: Create fake fault for zapping case

- Stash it in struct tdp_iter and pass around iter:
  Pro: work for zapping case
  Con: we need to change the code to pass down tdp_iter

- Pass around is_private (or mirrored_pt or whatever):
  Pro: Don't need to add member to some structure
  Con: We need to pass it around still.

- Stash it in kvm_mmu_page:
  The patch series uses kvm_mmu_page.role.
  Pro: We don't need to pass around because we know struct kvm_mmu_page
  Con: Need to twist root page allocation

- Use gfn. kvm_is_private_gfn(kvm, gfn):
  Con: The use of gfn is confusing.  It's too TDX specific.


> And the 'has_mirrored_pt' should be only used to select the root of the page
> table that we want to operate on.

We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
something to represent it.  We can initialize it in __kvm_mmu_do_page_fault().

.follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);


> This also gives a chance that if there's anything special needs to be done
> for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
> just fit.

Can you please elaborate on this?
Isaku Yamahata May 16, 2024, 12:27 a.m. UTC | #20
On Thu, May 16, 2024 at 12:13:44AM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:

> On Thu, 2024-05-16 at 11:38 +1200, Huang, Kai wrote:
> > On 16/05/2024 11:14 am, Edgecombe, Rick P wrote:
> > > On Thu, 2024-05-16 at 10:17 +1200, Huang, Kai wrote:
> > > > > TDX has several aspects related to the TDP MMU.
> > > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > > >        (private-vs-shared)
> > > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct
> > > > > memory
> > > > >        load/store.  TDP MMU needs hooks for it.
> > > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > > > 
> > > > > For 1) and 2), what about something like this?  TDX backend code will
> > > > > set
> > > > > kvm->arch.has_mirrored_pt = true; I think we will use
> > > > > kvm_gfn_shared_mask()
> > > > > only
> > > > > for address conversion (shared<->private).
> > > 
> > > 1 and 2 are not the same as "mirrored" though. You could have a design that
> > > mirrors half of the EPT and doesn't track it with separate roots. In fact, 1
> > > might be just a KVM design choice, even for TDX.
> > 
> > I am not sure whether I understand this correctly.  If they are not 
> > tracked with separate roots, it means they use the same page table (root).
> 
> There are three roots, right? Shared, private and mirrored. Shared and mirrored
> don't have to be different roots, but it makes some operations arguably easier
> to have it that way.

Do you have something like KVM_X86_SW_PROTECTED_VM with mirrored PT in mind?
or TDX thing?



> > So IIUC what you said is to support "mirror PT" at any sub-tree of the 
> > page table?
> > 
> > That will only complicate things.  I don't think we should consider 
> > this.  In reality, we only have TDX and SEV-SNP.  We should have a 
> > simple solution to cover both of them.
> 
> Look at "bool is_private" in kvm_tdp_mmu_map(). Do you see how it switches
> between different roots in the iterator? That is one use.
> 
> The second use is to decide whether to call out to the x86_ops. It happens via
> the role bit in the sp, which is copied from the parent sp role. The root's bit
> is set originally via a kvm_gfn_shared_mask() check.
> 
> BTW, the role bit is the thing I'm wondering if we really need, because we have
> shared_mask. While the shared_mask is used for lots of things today, we need
> still need it for masking GPAs. Where as the role bit is only needed to know if
> a SP is for private (which we can tell from the GPA).

I started the discussion at [1] for it.

[1] https://lore.kernel.org/kvm/20240516001530.GG168153@ls.amr.corp.intel.com/
Edgecombe, Rick P May 16, 2024, 12:52 a.m. UTC | #21
On Wed, 2024-05-15 at 17:15 -0700, Isaku Yamahata wrote:
> 
> kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> need_callback or whatever makes sense.

Name seems good to me. It's a better reason to have a role bit, as there could
be a shared_bit without mirroring.
Huang, Kai May 16, 2024, 1:11 a.m. UTC | #22
> 
> BTW, the role bit is the thing I'm wondering if we really need, because we have
> shared_mask. While the shared_mask is used for lots of things today, we need
> still need it for masking GPAs. Where as the role bit is only needed to know if
> a SP is for private (which we can tell from the GPA).

Yeah we can have a second thought on whether sp.role.private is 
necessary.  It is useful in shadow MMU (which we originally used to 
support at the first place), but may not be necessary for TDP MMU.
Huang, Kai May 16, 2024, 1:21 a.m. UTC | #23
On 16/05/2024 12:15 pm, Isaku Yamahata wrote:
> On Thu, May 16, 2024 at 10:17:50AM +1200,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
>> On 16/05/2024 4:22 am, Isaku Yamahata wrote:
>>> On Wed, May 15, 2024 at 08:34:37AM -0700,
>>> Sean Christopherson <seanjc@google.com> wrote:
>>>
>>>>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>>>>> index d5cf5b15a10e..808805b3478d 100644
>>>>> --- a/arch/x86/kvm/mmu/mmu.c
>>>>> +++ b/arch/x86/kvm/mmu/mmu.c
>>>>> @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>>>>>    	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
>>>>> -	if (tdp_mmu_enabled)
>>>>> +	if (tdp_mmu_enabled) {
>>>>> +		/*
>>>>> +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
>>>>> +		 * type was changed.  TDX can't handle zapping the private
>>>>> +		 * mapping, but it's ok because KVM doesn't support either of
>>>>> +		 * those features for TDX. In case a new caller appears, BUG
>>>>> +		 * the VM if it's called for solutions with private aliases.
>>>>> +		 */
>>>>> +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
>>>>
>>>> Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
>>>> generic name quite obviously doesn't prevent TDX details for bleeding into common
>>>> code, and dancing around things just makes it all unnecessarily confusing.
>>>>
>>>> If we can't avoid bleeding TDX details into common code, my vote is to bite the
>>>> bullet and simply check vm_type.
>>>
>>> TDX has several aspects related to the TDP MMU.
>>> 1) Based on the faulting GPA, determine which KVM page table to walk.
>>>      (private-vs-shared)
>>> 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
>>>      load/store.  TDP MMU needs hooks for it.
>>> 3) The tables must be zapped from the leaf. not the root or the middle.
>>>
>>> For 1) and 2), what about something like this?  TDX backend code will set
>>> kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
>>> for address conversion (shared<->private).
>>>
>>> For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
>>>           (or whatever preferable name)?
>>>
>>> For 3), flag of memslot handles it.
>>>
>>> ---
>>>    arch/x86/include/asm/kvm_host.h | 3 +++
>>>    1 file changed, 3 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index aabf1648a56a..218b575d24bd 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1289,6 +1289,7 @@ struct kvm_arch {
>>>    	u8 vm_type;
>>>    	bool has_private_mem;
>>>    	bool has_protected_state;
>>> +	bool has_mirrored_pt;
>>>    	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>>>    	struct list_head active_mmu_pages;
>>>    	struct list_head zapped_obsolete_pages;
>>> @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
>>>    #ifdef CONFIG_KVM_PRIVATE_MEM
>>>    #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
>>> +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
>>>    #else
>>>    #define kvm_arch_has_private_mem(kvm) false
>>> +#define kvm_arch_has_mirrored_pt(kvm) false
>>>    #endif
>>>    static inline u16 kvm_read_ldt(void)
>>
>> I think this 'has_mirrored_pt' (or a better name) is better, because it
>> clearly conveys it is for the "page table", but not the actual page that any
>> page table entry maps to.
>>
>> AFAICT we need to split the concept of "private page table itself" and the
>> "memory type of the actual GFN".
>>
>> E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
>> I was told only TDX uses a dedicated private page table which isn't directly
>> accessible for KVV.  SEV-SNP on the other hand just uses normal page table +
>> additional HW managed table to make sure the security.
> 
> kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> need_callback or whatever makes sense.
> 
> 
>> In other words, I think we should decide whether to invoke TDP MMU callback
>> for private mapping (the page table itself may just be normal one) depending
>> on the fault->is_private, but not whether the page table is private:
>>
>> 	if (fault->is_private && kvm_x86_ops->set_private_spte)
>> 		kvm_x86_set_private_spte(...);
>> 	else
>> 		tdp_mmu_set_spte_atomic(...);
> 
> This doesn't work for two reasons.
> 
> - We need to pass down struct kvm_page_fault fault deep only for this.
>    We could change the code in such way.
> 
> - We don't have struct kvm_page_fault fault for zapping case.
>    We could create a dummy one and pass it around.

For both above, we don't necessarily need the whole 'kvm_page_fault', we 
just need:

  1) GFN
  2) Whether it is private (points to private memory to be precise)
  3) use a separate private page table.

>    
> Essentially the issue is how to pass down is_private or stash the info
> somewhere or determine it somehow.  Options I think of are
> 
> - Pass around fault:
>    Con: fault isn't passed down
>    Con: Create fake fault for zapping case >
> - Stash it in struct tdp_iter and pass around iter:
>    Pro: work for zapping case
>    Con: we need to change the code to pass down tdp_iter >
> - Pass around is_private (or mirrored_pt or whatever):
>    Pro: Don't need to add member to some structure
>    Con: We need to pass it around still. >
> - Stash it in kvm_mmu_page:
>    The patch series uses kvm_mmu_page.role.
>    Pro: We don't need to pass around because we know struct kvm_mmu_page
>    Con: Need to twist root page allocation

I don't think using kvm_mmu_page.role is correct.

If kvm_mmu_page.role is private, we definitely can assume the faulting 
address is private; but otherwise the address can be both private or shared.

> 
> - Use gfn. kvm_is_private_gfn(kvm, gfn):
>    Con: The use of gfn is confusing.  It's too TDX specific.
> 
> 
>> And the 'has_mirrored_pt' should be only used to select the root of the page
>> table that we want to operate on.
> 
> We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
> something to represent it.  We can initialize it in __kvm_mmu_do_page_fault().
> 
> .follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);
> 
> 
>> This also gives a chance that if there's anything special needs to be done
>> for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
>> just fit.
> 
> Can you please elaborate on this?

I meant SEV-SNP may have it's own version of link_private_spt().

I haven't looked into it, and it may not needed from hardware's 
perspective, but providing such chance certainly doesn't hurt and is 
more flexible IMHO.
Isaku Yamahata May 16, 2024, 5:27 p.m. UTC | #24
On Thu, May 16, 2024 at 01:21:40PM +1200,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On 16/05/2024 12:15 pm, Isaku Yamahata wrote:
> > On Thu, May 16, 2024 at 10:17:50AM +1200,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On 16/05/2024 4:22 am, Isaku Yamahata wrote:
> > > > On Wed, May 15, 2024 at 08:34:37AM -0700,
> > > > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > > > index d5cf5b15a10e..808805b3478d 100644
> > > > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > > > > @@ -6528,8 +6528,17 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > > > > >    	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
> > > > > > -	if (tdp_mmu_enabled)
> > > > > > +	if (tdp_mmu_enabled) {
> > > > > > +		/*
> > > > > > +		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
> > > > > > +		 * type was changed.  TDX can't handle zapping the private
> > > > > > +		 * mapping, but it's ok because KVM doesn't support either of
> > > > > > +		 * those features for TDX. In case a new caller appears, BUG
> > > > > > +		 * the VM if it's called for solutions with private aliases.
> > > > > > +		 */
> > > > > > +		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
> > > > > 
> > > > > Please stop using kvm_gfn_shared_mask() as a proxy for "is this TDX".  Using a
> > > > > generic name quite obviously doesn't prevent TDX details for bleeding into common
> > > > > code, and dancing around things just makes it all unnecessarily confusing.
> > > > > 
> > > > > If we can't avoid bleeding TDX details into common code, my vote is to bite the
> > > > > bullet and simply check vm_type.
> > > > 
> > > > TDX has several aspects related to the TDP MMU.
> > > > 1) Based on the faulting GPA, determine which KVM page table to walk.
> > > >      (private-vs-shared)
> > > > 2) Need to call TDX SEAMCALL to operate on Secure-EPT instead of direct memory
> > > >      load/store.  TDP MMU needs hooks for it.
> > > > 3) The tables must be zapped from the leaf. not the root or the middle.
> > > > 
> > > > For 1) and 2), what about something like this?  TDX backend code will set
> > > > kvm->arch.has_mirrored_pt = true; I think we will use kvm_gfn_shared_mask() only
> > > > for address conversion (shared<->private).
> > > > 
> > > > For 1), maybe we can add struct kvm_page_fault.walk_mirrored_pt
> > > >           (or whatever preferable name)?
> > > > 
> > > > For 3), flag of memslot handles it.
> > > > 
> > > > ---
> > > >    arch/x86/include/asm/kvm_host.h | 3 +++
> > > >    1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index aabf1648a56a..218b575d24bd 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1289,6 +1289,7 @@ struct kvm_arch {
> > > >    	u8 vm_type;
> > > >    	bool has_private_mem;
> > > >    	bool has_protected_state;
> > > > +	bool has_mirrored_pt;
> > > >    	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > >    	struct list_head active_mmu_pages;
> > > >    	struct list_head zapped_obsolete_pages;
> > > > @@ -2171,8 +2172,10 @@ void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level,
> > > >    #ifdef CONFIG_KVM_PRIVATE_MEM
> > > >    #define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem)
> > > > +#define kvm_arch_has_mirrored_pt(kvm) ((kvm)->arch.has_mirrored_pt)
> > > >    #else
> > > >    #define kvm_arch_has_private_mem(kvm) false
> > > > +#define kvm_arch_has_mirrored_pt(kvm) false
> > > >    #endif
> > > >    static inline u16 kvm_read_ldt(void)
> > > 
> > > I think this 'has_mirrored_pt' (or a better name) is better, because it
> > > clearly conveys it is for the "page table", but not the actual page that any
> > > page table entry maps to.
> > > 
> > > AFAICT we need to split the concept of "private page table itself" and the
> > > "memory type of the actual GFN".
> > > 
> > > E.g., both SEV-SNP and TDX has concept of "private memory" (obviously), but
> > > I was told only TDX uses a dedicated private page table which isn't directly
> > > accessible for KVV.  SEV-SNP on the other hand just uses normal page table +
> > > additional HW managed table to make sure the security.
> > 
> > kvm_mmu_page_role.is_private is not good name now. Probably is_mirrored_pt or
> > need_callback or whatever makes sense.
> > 
> > 
> > > In other words, I think we should decide whether to invoke TDP MMU callback
> > > for private mapping (the page table itself may just be normal one) depending
> > > on the fault->is_private, but not whether the page table is private:
> > > 
> > > 	if (fault->is_private && kvm_x86_ops->set_private_spte)
> > > 		kvm_x86_set_private_spte(...);
> > > 	else
> > > 		tdp_mmu_set_spte_atomic(...);
> > 
> > This doesn't work for two reasons.
> > 
> > - We need to pass down struct kvm_page_fault fault deep only for this.
> >    We could change the code in such way.
> > 
> > - We don't have struct kvm_page_fault fault for zapping case.
> >    We could create a dummy one and pass it around.
> 
> For both above, we don't necessarily need the whole 'kvm_page_fault', we
> just need:
> 
>  1) GFN
>  2) Whether it is private (points to private memory to be precise)
>  3) use a separate private page table.

Ok, so you suggest passing around necessary info (if missing) somehow.


> > Essentially the issue is how to pass down is_private or stash the info
> > somewhere or determine it somehow.  Options I think of are
> > 
> > - Pass around fault:
> >    Con: fault isn't passed down
> >    Con: Create fake fault for zapping case >
> > - Stash it in struct tdp_iter and pass around iter:
> >    Pro: work for zapping case
> >    Con: we need to change the code to pass down tdp_iter >
> > - Pass around is_private (or mirrored_pt or whatever):
> >    Pro: Don't need to add member to some structure
> >    Con: We need to pass it around still. >
> > - Stash it in kvm_mmu_page:
> >    The patch series uses kvm_mmu_page.role.
> >    Pro: We don't need to pass around because we know struct kvm_mmu_page
> >    Con: Need to twist root page allocation
> 
> I don't think using kvm_mmu_page.role is correct.
> 
> If kvm_mmu_page.role is private, we definitely can assume the faulting
> address is private; but otherwise the address can be both private or shared.

What do you mean by the last sentence.  For example, do you mean memslot
deletion?  In that case, we need to GPA with shared bit for shared PT, GPA
without shared bit for mirrored/private PT.  Or do you mean something else?


> > - Use gfn. kvm_is_private_gfn(kvm, gfn):
> >    Con: The use of gfn is confusing.  It's too TDX specific.
> > 
> > 
> > > And the 'has_mirrored_pt' should be only used to select the root of the page
> > > table that we want to operate on.
> > 
> > We can add one more bool to struct kvm_page_fault.follow_mirrored_pt or
> > something to represent it.  We can initialize it in __kvm_mmu_do_page_fault().
> > 
> > .follow_mirrored_pt = kvm->arch.has_mirrored_pt && kvm_is_private_gpa(gpa);
> > 
> > 
> > > This also gives a chance that if there's anything special needs to be done
> > > for page allocated for the "non-leaf" middle page table for SEV-SNP, it can
> > > just fit.
> > 
> > Can you please elaborate on this?
> 
> I meant SEV-SNP may have it's own version of link_private_spt().
> 
> I haven't looked into it, and it may not needed from hardware's perspective,
> but providing such chance certainly doesn't hurt and is more flexible IMHO.

It doesn't need TDP MMU hooks.
Edgecombe, Rick P May 16, 2024, 9:46 p.m. UTC | #25
On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> 
> For lack of a better method currently, use kvm_gfn_shared_mask() to
> determine if private memory cannot be zapped (as in TDX, the only VM type
> that sets it).

Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw that
SNP actually uses this function:
https://lore.kernel.org/kvm/20240501085210.2213060-12-michael.roth@amd.com/

So trying to have a helper that says "The VM can't zap and refault in memory at
will" won't cut it. I guess there would have to be some more specific. I'm
thinking to just drop this patch instead.
Huang, Kai May 16, 2024, 10:23 p.m. UTC | #26
On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>>
>> For lack of a better method currently, use kvm_gfn_shared_mask() to
>> determine if private memory cannot be zapped (as in TDX, the only VM type
>> that sets it).
> 
> Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw that
> SNP actually uses this function:
> https://lore.kernel.org/kvm/20240501085210.2213060-12-michael.roth@amd.com/
> 
> So trying to have a helper that says "The VM can't zap and refault in memory at
> will" won't cut it. I guess there would have to be some more specific. I'm
> thinking to just drop this patch instead.

Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX 
as I mentioned before.

Having such checking in a generic function like this is just dangerous 
and not flexible.

Just my 2 cents, though.
Edgecombe, Rick P May 16, 2024, 10:38 p.m. UTC | #27
On Fri, 2024-05-17 at 10:23 +1200, Huang, Kai wrote:
> On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
> > On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
> > > 
> > > For lack of a better method currently, use kvm_gfn_shared_mask() to
> > > determine if private memory cannot be zapped (as in TDX, the only VM type
> > > that sets it).
> > 
> > Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw
> > that
> > SNP actually uses this function:
> > https://lore.kernel.org/kvm/20240501085210.2213060-12-michael.roth@amd.com/
> > 
> > So trying to have a helper that says "The VM can't zap and refault in memory
> > at
> > will" won't cut it. I guess there would have to be some more specific. I'm
> > thinking to just drop this patch instead.
> 
> Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX 
> as I mentioned before.
> 
> Having such checking in a generic function like this is just dangerous 
> and not flexible.
> 
> Just my 2 cents, though.

As I said before, the point is to catch new callers. I see how it's a little
wrong to assume the intentions of the callers, but I don't see how it's
dangerous. Can you explain?

But you just reminded me that, yes, we can probably just check the vm_type here:
https://lore.kernel.org/kvm/f64c7da52a849cd9697b944769c200dfa3ee7db7.camel@intel.com/
Huang, Kai May 16, 2024, 11:16 p.m. UTC | #28
On 17/05/2024 10:38 am, Edgecombe, Rick P wrote:
> On Fri, 2024-05-17 at 10:23 +1200, Huang, Kai wrote:
>> On 17/05/2024 9:46 am, Edgecombe, Rick P wrote:
>>> On Tue, 2024-05-14 at 17:59 -0700, Rick Edgecombe wrote:
>>>>
>>>> For lack of a better method currently, use kvm_gfn_shared_mask() to
>>>> determine if private memory cannot be zapped (as in TDX, the only VM type
>>>> that sets it).
>>>
>>> Trying to replace kvm_gfn_shared_mask() with something appropriate, I saw
>>> that
>>> SNP actually uses this function:
>>> https://lore.kernel.org/kvm/20240501085210.2213060-12-michael.roth@amd.com/
>>>
>>> So trying to have a helper that says "The VM can't zap and refault in memory
>>> at
>>> will" won't cut it. I guess there would have to be some more specific. I'm
>>> thinking to just drop this patch instead.
>>
>> Or KVM_BUG_ON() in the callers by explicitly checking VM type being TDX
>> as I mentioned before.
>>
>> Having such checking in a generic function like this is just dangerous
>> and not flexible.
>>
>> Just my 2 cents, though.
> 
> As I said before, the point is to catch new callers. I see how it's a little
> wrong to assume the intentions of the callers, but I don't see how it's
> dangerous. Can you explain?

Dangerous means when "a little wrong to assume the intentions of the 
callers" actually goes wrong.  In other words, a general intention to 
"catch new callers" doesn't make a lot sense to me.

Anyway as said before, it's just my 2 cents, and it's totally up to you.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d5cf5b15a10e..808805b3478d 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6528,8 +6528,17 @@  void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end);
 
-	if (tdp_mmu_enabled)
+	if (tdp_mmu_enabled) {
+		/*
+		 * kvm_zap_gfn_range() is used when MTRR or PAT memory
+		 * type was changed.  TDX can't handle zapping the private
+		 * mapping, but it's ok because KVM doesn't support either of
+		 * those features for TDX. In case a new caller appears, BUG
+		 * the VM if it's called for solutions with private aliases.
+		 */
+		KVM_BUG_ON(kvm_gfn_shared_mask(kvm), kvm);
 		flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, gfn_end, flush);
+	}
 
 	if (flush)
 		kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - gfn_start);