diff mbox series

[09/16] KVM: x86/mmu: Move private vs. shared check above slot validity checks

Message ID 20240228024147.41573-10-seanjc@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86/mmu: Page fault and MMIO cleanups | expand

Commit Message

Sean Christopherson Feb. 28, 2024, 2:41 a.m. UTC
Prioritize private vs. shared gfn attribute checks above slot validity
checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
userspace if there is no memslot, but emulate accesses to the APIC access
page even if the attributes mismatch.

Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

Comments

Huang, Kai March 5, 2024, 11:06 p.m. UTC | #1
On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> Prioritize private vs. shared gfn attribute checks above slot validity
> checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> userspace if there is no memslot, but emulate accesses to the APIC access
> page even if the attributes mismatch.

IMHO, it would be helpful to explicitly say that, in the later case 
(emulate APIC access page) we still want to report MEMORY_FAULT error 
first (so that userspace can have chance to fixup, IIUC) instead of 
emulating directly, which will unlikely work.

Reviewed-by: Kai Huang <kai.huang@intel.com>

> 
> Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
> Cc: Chao Peng <chao.p.peng@linux.intel.com>
> Cc: Fuad Tabba <tabba@google.com>
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
>   1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9206cfa58feb..58c5ae8be66c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>   			return RET_PF_EMULATE;
>   	}
>   
> -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> -		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> -		return -EFAULT;
> -	}
> -
>   	if (fault->is_private)
>   		return kvm_faultin_pfn_private(vcpu, fault);
>   
> @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>   	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
>   	smp_rmb();
>   
> +	/*
> +	 * Check for a private vs. shared mismatch *after* taking a snapshot of
> +	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> +	 * invalidation notifier.
> +	 */
> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +
>   	/*
>   	 * Check for a relevant mmu_notifier invalidation event before getting
>   	 * the pfn from the primary MMU, and before acquiring mmu_lock.
Sean Christopherson March 6, 2024, 12:38 a.m. UTC | #2
On Wed, Mar 06, 2024, Kai Huang wrote:
> 
> 
> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > Prioritize private vs. shared gfn attribute checks above slot validity
> > checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> > userspace if there is no memslot, but emulate accesses to the APIC access
> > page even if the attributes mismatch.
> 
> IMHO, it would be helpful to explicitly say that, in the later case (emulate
> APIC access page) we still want to report MEMORY_FAULT error first (so that
> userspace can have chance to fixup, IIUC) instead of emulating directly,
> which will unlikely work.

Hmm, it's not so much that emulating directly won't work, it's that KVM would be
violating its ABI.  Emulating APIC accesses after userspace converted the APIC
gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
is shared-only.

FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
this is purely to ensure KVM has simple, consistent rules for how private vs.
shared access work.
Huang, Kai March 6, 2024, 1:22 a.m. UTC | #3
On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Kai Huang wrote:
>>
>>
>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>> checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>> page even if the attributes mismatch.
>>
>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>> which will unlikely work.
> 
> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> violating its ABI.  Emulating APIC accesses after userspace converted the APIC
> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> is shared-only.

But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be 
mapped as private, right?  The guest is supposed to get a #VE anyway?

Perhaps I am missing something -- I apologize if this has already been 
discussed.

> 
> FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
> this is purely to ensure KVM has simple, consistent rules for how private vs.
> shared access work.

Again I _think_ for TDX APIC gfn can be private?  IIUC virtualizing APIC 
is done by the TDX module, which injects #VE to guest when emulation is 
required.
Sean Christopherson March 6, 2024, 2:02 a.m. UTC | #4
On Wed, Mar 06, 2024, Kai Huang wrote:
> 
> 
> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> > On Wed, Mar 06, 2024, Kai Huang wrote:
> > > 
> > > 
> > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > > checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> > > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > > page even if the attributes mismatch.
> > > 
> > > IMHO, it would be helpful to explicitly say that, in the later case (emulate
> > > APIC access page) we still want to report MEMORY_FAULT error first (so that
> > > userspace can have chance to fixup, IIUC) instead of emulating directly,
> > > which will unlikely work.
> > 
> > Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> > violating its ABI.  Emulating APIC accesses after userspace converted the APIC
> > gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> > is shared-only.
> 
> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
> mapped as private, right?  The guest is supposed to get a #VE anyway?

Not really.  KVM can't _map_ emulated MMIO as private memory, because S-EPT
entries can only point at convertible memory.  KVM _could_ emulate in response
to a !PRESENT EPT violation, but KVM is not going to do that.

https://lore.kernel.org/all/ZcUO5sFEAIH68JIA@google.com

> Perhaps I am missing something -- I apologize if this has already been
> discussed.
> 
> > 
> > FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
> > this is purely to ensure KVM has simple, consistent rules for how private vs.
> > shared access work.
> 
> Again I _think_ for TDX APIC gfn can be private?  IIUC virtualizing APIC is
> done by the TDX module, which injects #VE to guest when emulation is
> required.

It's a moot point for TDX, as x2APIC is mandatory.
Huang, Kai March 6, 2024, 10:06 p.m. UTC | #5
On 6/03/2024 3:02 pm, Sean Christopherson wrote:
> On Wed, Mar 06, 2024, Kai Huang wrote:
>>
>>
>> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>>>> checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
>>>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>>>> page even if the attributes mismatch.
>>>>
>>>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>>>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>>>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>>>> which will unlikely work.
>>>
>>> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
>>> violating its ABI.  Emulating APIC accesses after userspace converted the APIC
>>> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
>>> is shared-only.
>>
>> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
>> mapped as private, right?  The guest is supposed to get a #VE anyway?
> 
> Not really.  KVM can't _map_ emulated MMIO as private memory, because S-EPT
> entries can only point at convertible memory.  

Right.  I was talking about the MMIO mapping in the guest, which can be 
private I suppose.

KVM _could_ emulate in response
> to a !PRESENT EPT violation, but KVM is not going to do that.
> 
> https://lore.kernel.org/all/ZcUO5sFEAIH68JIA@google.com
> 

Right agreed KVM shouldn't "emulate" !PRESENT fault.

I am talking about this -- for TDX guest, if I recall correctly, for 
guest's MMIO gfn KVM still needs to get the EPT violation for the 
_first_ access, in which KVM can configure the EPT entry to make sure 
"suppress #VE" bit is cleared so the later accesses can trigger #VE 
directly.

I suppose this is still the way we want to implement?

I am afraid for this case, we will still see the MMIO GFN as private, 
while by default I believe the "guest memory attributes" for that MMIO 
GFN should be shared?  AFAICT, it seems the "guest memory attributes" 
code doesn't check whether a GFN is MMIO or truly RAM.

So I agree making KVM only "emulate" shared MMIO makes sense, and we 
need this patch to cover the APIC access page case.

Anyway:

Reviewed-by: Kai Huang <kai.huang@intel.com>

>> Perhaps I am missing something -- I apologize if this has already been
>> discussed.
>>
>>>
>>> FWIW, I doubt there's a legitmate use case for converting the APIC gfn to private,
>>> this is purely to ensure KVM has simple, consistent rules for how private vs.
>>> shared access work.
>>
>> Again I _think_ for TDX APIC gfn can be private?  IIUC virtualizing APIC is
>> done by the TDX module, which injects #VE to guest when emulation is
>> required.
> 
> It's a moot point for TDX, as x2APIC is mandatory.

Right.  My bad to mention.
Sean Christopherson March 6, 2024, 11:49 p.m. UTC | #6
On Thu, Mar 07, 2024, Kai Huang wrote:
> 
> 
> On 6/03/2024 3:02 pm, Sean Christopherson wrote:
> > On Wed, Mar 06, 2024, Kai Huang wrote:
> > > 
> > > 
> > > On 6/03/2024 1:38 pm, Sean Christopherson wrote:
> > > > On Wed, Mar 06, 2024, Kai Huang wrote:
> > > > > 
> > > > > 
> > > > > On 28/02/2024 3:41 pm, Sean Christopherson wrote:
> > > > > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > > > > checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> > > > > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > > > > page even if the attributes mismatch.
> > > > > 
> > > > > IMHO, it would be helpful to explicitly say that, in the later case (emulate
> > > > > APIC access page) we still want to report MEMORY_FAULT error first (so that
> > > > > userspace can have chance to fixup, IIUC) instead of emulating directly,
> > > > > which will unlikely work.
> > > > 
> > > > Hmm, it's not so much that emulating directly won't work, it's that KVM would be
> > > > violating its ABI.  Emulating APIC accesses after userspace converted the APIC
> > > > gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
> > > > is shared-only.
> > > 
> > > But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
> > > mapped as private, right?  The guest is supposed to get a #VE anyway?
> > 
> > Not really.  KVM can't _map_ emulated MMIO as private memory, because S-EPT
> > entries can only point at convertible memory.
> 
> Right.  I was talking about the MMIO mapping in the guest, which can be
> private I suppose.
> 
> > KVM _could_ emulate in response to a !PRESENT EPT violation, but KVM is not
> > going to do that.
> > 
> > https://lore.kernel.org/all/ZcUO5sFEAIH68JIA@google.com
> > 
> 
> Right agreed KVM shouldn't "emulate" !PRESENT fault.

One clarification.  KVM *does* emulate !PRESENT faults.  And that's not optional,
as caching MMIO info in SPTEs is purely an optimization and isn't possible on all
CPUs, e.g. AMD CPUs with MAXPHYADDR=52 don't have reserved bits to set.

My point above was specifically about emulating *private* !PRESENT faults as MMIO.

> I am talking about this -- for TDX guest, if I recall correctly, for guest's
> MMIO gfn KVM still needs to get the EPT violation for the _first_ access, in
> which KVM can configure the EPT entry to make sure "suppress #VE" bit is
> cleared so the later accesses can trigger #VE directly.

That's totally fine, so long as the access is shared, not private.

> I suppose this is still the way we want to implement?
> 
> I am afraid for this case, we will still see the MMIO GFN as private, while
> by default I believe the "guest memory attributes" for that MMIO GFN should
> be shared?

No, the guest should know it's (emulated) MMIO and access the gfn as shared.  That
might generate a !PRESENT fault, but that's again a-ok.

> AFAICT, it seems the "guest memory attributes" code doesn't check whether a
> GFN is MMIO or truly RAM.

That's userspace's responsibility, not KVM's responsibility.  And if userspace
doesn't proactively make emulated MMIO regions shared, then the memory_fault exit
that results from this patch will give userspace the hook it needs to convert the
gfn to shared on-demand.
Huang, Kai March 7, 2024, 12:28 a.m. UTC | #7
On 7/03/2024 12:49 pm, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, Kai Huang wrote:
>>
>>
>> On 6/03/2024 3:02 pm, Sean Christopherson wrote:
>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>
>>>>
>>>> On 6/03/2024 1:38 pm, Sean Christopherson wrote:
>>>>> On Wed, Mar 06, 2024, Kai Huang wrote:
>>>>>>
>>>>>>
>>>>>> On 28/02/2024 3:41 pm, Sean Christopherson wrote:
>>>>>>> Prioritize private vs. shared gfn attribute checks above slot validity
>>>>>>> checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
>>>>>>> userspace if there is no memslot, but emulate accesses to the APIC access
>>>>>>> page even if the attributes mismatch.
>>>>>>
>>>>>> IMHO, it would be helpful to explicitly say that, in the later case (emulate
>>>>>> APIC access page) we still want to report MEMORY_FAULT error first (so that
>>>>>> userspace can have chance to fixup, IIUC) instead of emulating directly,
>>>>>> which will unlikely work.
>>>>>
>>>>> Hmm, it's not so much that emulating directly won't work, it's that KVM would be
>>>>> violating its ABI.  Emulating APIC accesses after userspace converted the APIC
>>>>> gfn to private would still work (I think), but KVM's ABI is that emulated MMIO
>>>>> is shared-only.
>>>>
>>>> But for (at least) TDX guest I recall we _CAN_ allow guest's MMIO to be
>>>> mapped as private, right?  The guest is supposed to get a #VE anyway?
>>>
>>> Not really.  KVM can't _map_ emulated MMIO as private memory, because S-EPT
>>> entries can only point at convertible memory.
>>
>> Right.  I was talking about the MMIO mapping in the guest, which can be
>> private I suppose.
>>
>>> KVM _could_ emulate in response to a !PRESENT EPT violation, but KVM is not
>>> going to do that.
>>>
>>> https://lore.kernel.org/all/ZcUO5sFEAIH68JIA@google.com
>>>
>>
>> Right agreed KVM shouldn't "emulate" !PRESENT fault.
> 
> One clarification.  KVM *does* emulate !PRESENT faults.  And that's not optional,
> as caching MMIO info in SPTEs is purely an optimization and isn't possible on all
> CPUs, e.g. AMD CPUs with MAXPHYADDR=52 don't have reserved bits to set.

Sorry I forgot to add "private".

> 
> My point above was specifically about emulating *private* !PRESENT faults as MMIO.
> 
>> I am talking about this -- for TDX guest, if I recall correctly, for guest's
>> MMIO gfn KVM still needs to get the EPT violation for the _first_ access, in
>> which KVM can configure the EPT entry to make sure "suppress #VE" bit is
>> cleared so the later accesses can trigger #VE directly.
> 
> That's totally fine, so long as the access is shared, not private.

OK as you already replied in the later patch.

> 
>> I suppose this is still the way we want to implement?
>>
>> I am afraid for this case, we will still see the MMIO GFN as private, while
>> by default I believe the "guest memory attributes" for that MMIO GFN should
>> be shared?
> 
> No, the guest should know it's (emulated) MMIO and access the gfn as shared.  That
> might generate a !PRESENT fault, but that's again a-ok.

Ditto.

> 
>> AFAICT, it seems the "guest memory attributes" code doesn't check whether a
>> GFN is MMIO or truly RAM.
> 
> That's userspace's responsibility, not KVM's responsibility.  And if userspace
> doesn't proactively make emulated MMIO regions shared, then the memory_fault exit
> that results from this patch will give userspace the hook it needs to convert the
> gfn to shared on-demand.

I mean whether it's better to just make kvm_mem_is_private() check 
whether the given GFN has slot, and always return "shared" if it doesn't.

In kvm_vm_set_mem_attributes() we also ignore NULL-slot GFNs.

(APIC access page is a special case that needs to handle.)

Allowing userspace to maintain whether MMIO GFN is shared or private 
doesn't make a lot sense because that doesn't help a lot due to the MMIO 
mapping is actually controlled by the guest kernel itself.

The (buggy) guest may still generate private !PRESNET faults, and KVM 
can still return to userspace with MEMORY_FAULT, but userspace can just 
kill the VM if the faulting address is MMIO.

But this will complicate the code so I guess may not worth to do..

Thanks for your time. :-)
Xu Yilun March 8, 2024, 4:54 a.m. UTC | #8
On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> Prioritize private vs. shared gfn attribute checks above slot validity
> checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> userspace if there is no memslot, but emulate accesses to the APIC access
> page even if the attributes mismatch.
> 
> Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
> Cc: Chao Peng <chao.p.peng@linux.intel.com>
> Cc: Fuad Tabba <tabba@google.com>
> Cc: Michael Roth <michael.roth@amd.com>
> Cc: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9206cfa58feb..58c5ae8be66c 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  			return RET_PF_EMULATE;
>  	}
>  
> -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> -		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> -		return -EFAULT;
> -	}
> -
>  	if (fault->is_private)
>  		return kvm_faultin_pfn_private(vcpu, fault);
>  
> @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
>  	smp_rmb();
>  
> +	/*
> +	 * Check for a private vs. shared mismatch *after* taking a snapshot of
> +	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> +	 * invalidation notifier.

I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
And there is no synchronization between the below check and
kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
after the snapshot.  So why this comment?

Thanks,
Yilun

> +	 */
> +	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> +		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> +		return -EFAULT;
> +	}
> +
>  	/*
>  	 * Check for a relevant mmu_notifier invalidation event before getting
>  	 * the pfn from the primary MMU, and before acquiring mmu_lock.
> -- 
> 2.44.0.278.ge034bb2e1d-goog
> 
>
Sean Christopherson March 8, 2024, 11:28 p.m. UTC | #9
On Fri, Mar 08, 2024, Xu Yilun wrote:
> On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > Prioritize private vs. shared gfn attribute checks above slot validity
> > checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> > userspace if there is no memslot, but emulate accesses to the APIC access
> > page even if the attributes mismatch.
> > 
> > Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> > Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
> > Cc: Chao Peng <chao.p.peng@linux.intel.com>
> > Cc: Fuad Tabba <tabba@google.com>
> > Cc: Michael Roth <michael.roth@amd.com>
> > Cc: Isaku Yamahata <isaku.yamahata@intel.com>
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 9206cfa58feb..58c5ae8be66c 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  			return RET_PF_EMULATE;
> >  	}
> >  
> > -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > -		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > -		return -EFAULT;
> > -	}
> > -
> >  	if (fault->is_private)
> >  		return kvm_faultin_pfn_private(vcpu, fault);
> >  
> > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> >  	smp_rmb();
> >  
> > +	/*
> > +	 * Check for a private vs. shared mismatch *after* taking a snapshot of
> > +	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > +	 * invalidation notifier.
> 
> I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> And there is no synchronization between the below check and
> kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> after the snapshot.

There is synchronization.  If kvm_vm_set_mem_attributes() changes the attributes,
and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
then is_page_fault_stale() will detect that an invalidation related to the gfn
occured and resume the guest *without* installing a mapping in KVM's page tables.

I.e. KVM may read the old, stale gfn attributes, but it will never actually
expose the stale attirubtes to the guest.
Xu Yilun March 11, 2024, 4:43 a.m. UTC | #10
On Fri, Mar 08, 2024 at 03:28:08PM -0800, Sean Christopherson wrote:
> On Fri, Mar 08, 2024, Xu Yilun wrote:
> > On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > > Prioritize private vs. shared gfn attribute checks above slot validity
> > > checks to ensure a consistent userspace ABI.  E.g. as is, KVM will exit to
> > > userspace if there is no memslot, but emulate accesses to the APIC access
> > > page even if the attributes mismatch.
> > > 
> > > Fixes: 8dd2eee9d526 ("KVM: x86/mmu: Handle page fault for private memory")
> > > Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
> > > Cc: Chao Peng <chao.p.peng@linux.intel.com>
> > > Cc: Fuad Tabba <tabba@google.com>
> > > Cc: Michael Roth <michael.roth@amd.com>
> > > Cc: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c | 15 ++++++++++-----
> > >  1 file changed, 10 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 9206cfa58feb..58c5ae8be66c 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -4365,11 +4365,6 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> > >  			return RET_PF_EMULATE;
> > >  	}
> > >  
> > > -	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> > > -		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> > > -		return -EFAULT;
> > > -	}
> > > -
> > >  	if (fault->is_private)
> > >  		return kvm_faultin_pfn_private(vcpu, fault);
> > >  
> > > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > >  	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > >  	smp_rmb();
> > >  
> > > +	/*
> > > +	 * Check for a private vs. shared mismatch *after* taking a snapshot of
> > > +	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > > +	 * invalidation notifier.
> > 
> > I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> > And there is no synchronization between the below check and
> > kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> > after the snapshot.
> 
> There is synchronization.  If kvm_vm_set_mem_attributes() changes the attributes,
> and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
> then is_page_fault_stale() will detect that an invalidation related to the gfn
> occured and resume the guest *without* installing a mapping in KVM's page tables.
> 
> I.e. KVM may read the old, stale gfn attributes, but it will never actually
> expose the stale attirubtes to the guest.

That makes sense!  I was just thinking of the racing for below few lines,

	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
		return -EFAULT;
	}

But the guarding is actually for the whole kvm_faultin_pfn(). It is the
the same mechanism between getting old gfn attributes and getting old pfn.

I wonder if we could instead add some general comments at

   fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;

about the snapshot and is_page_fault_stale() thing.

Thanks,
Yilun
>
Sean Christopherson March 12, 2024, 12:08 a.m. UTC | #11
On Mon, Mar 11, 2024, Xu Yilun wrote:
> On Fri, Mar 08, 2024 at 03:28:08PM -0800, Sean Christopherson wrote:
> > On Fri, Mar 08, 2024, Xu Yilun wrote:
> > > On Tue, Feb 27, 2024 at 06:41:40PM -0800, Sean Christopherson wrote:
> > > > @@ -4410,6 +4405,16 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > >  	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> > > >  	smp_rmb();
> > > >  
> > > > +	/*
> > > > +	 * Check for a private vs. shared mismatch *after* taking a snapshot of
> > > > +	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
> > > > +	 * invalidation notifier.
> > > 
> > > I didn't see how mmu_invalidate_seq influences gfn attribute judgement.
> > > And there is no synchronization between the below check and
> > > kvm_vm_set_mem_attributes(), the gfn attribute could still be changing
> > > after the snapshot.
> > 
> > There is synchronization.  If kvm_vm_set_mem_attributes() changes the attributes,
> > and thus bumps mmu_invalidate_seq, after kvm_faultin_pfn() takes its snapshot,
> > then is_page_fault_stale() will detect that an invalidation related to the gfn
> > occured and resume the guest *without* installing a mapping in KVM's page tables.
> > 
> > I.e. KVM may read the old, stale gfn attributes, but it will never actually
> > expose the stale attirubtes to the guest.
> 
> That makes sense!  I was just thinking of the racing for below few lines,
> 
> 	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
> 		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
> 		return -EFAULT;
> 	}
> 
> But the guarding is actually for the whole kvm_faultin_pfn(). It is the
> the same mechanism between getting old gfn attributes and getting old pfn.
> 
> I wonder if we could instead add some general comments at
> 
>    fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
> 
> about the snapshot and is_page_fault_stale() thing.

Yeah, I'll add a comment.  The only reason not to add a comment is that, ideally,
the comment/documentation would live in common KVM code, not x86.  But this code
already has a few big comments about the mmu_notifier retry logic, one more
definitely won't hurt.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9206cfa58feb..58c5ae8be66c 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4365,11 +4365,6 @@  static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 			return RET_PF_EMULATE;
 	}
 
-	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
-		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
-		return -EFAULT;
-	}
-
 	if (fault->is_private)
 		return kvm_faultin_pfn_private(vcpu, fault);
 
@@ -4410,6 +4405,16 @@  static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	fault->mmu_seq = vcpu->kvm->mmu_invalidate_seq;
 	smp_rmb();
 
+	/*
+	 * Check for a private vs. shared mismatch *after* taking a snapshot of
+	 * mmu_invalidate_seq, as changes to gfn attributes are guarded by the
+	 * invalidation notifier.
+	 */
+	if (fault->is_private != kvm_mem_is_private(vcpu->kvm, fault->gfn)) {
+		kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
+		return -EFAULT;
+	}
+
 	/*
 	 * Check for a relevant mmu_notifier invalidation event before getting
 	 * the pfn from the primary MMU, and before acquiring mmu_lock.