diff mbox series

[v4,08/16] KVM: x86: Annotate -EFAULTs from kvm_handle_error_pfn()

Message ID 20230602161921.208564-9-amoorthy@google.com (mailing list archive)
State New, archived
Headers show
Series Improve scalability of KVM + userfaultfd live migration via annotated memory faults. | expand

Commit Message

Anish Moorthy June 2, 2023, 4:19 p.m. UTC
Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
kvm_handle_error_pfn().

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Sean Christopherson June 14, 2023, 8:03 p.m. UTC | #1
On Fri, Jun 02, 2023, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_handle_error_pfn().
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..cb71aae9aaec 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
>  
>  static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
> +	uint64_t rounded_gfn;

gfn_t, and probably no need to specificy "rounded", let the code do the talking.

> +	uint64_t fault_size;
> +	uint64_t fault_flags;

With a few exceptions that we should kill off, KVM uses "u64", not "uint64_t".
Though arguably the "size" should be gfn_t too.

And these can all go on a single line, e.g.

	u64 fault_size, fault_flags;

Though since the kvm_run.memory_fault field and the param to the helper are "len",
a simple "len" here is better IMO.

And since this is not remotely performance sensitive, I wouldn't bother deferring
the initialization, e.g.

	gfn_t gfn = gfn_round_for_level(fault->gfn, fault->goal_level);
	gfn_t len = KVM_HPAGE_SIZE(fault->goal_level);
	u64 fault_flags;

All that said, consuming fault->goal_level is unnecessary, and not be coincidence.
The *only* time KVM should bail to userspace is if KVM failed to faultin a 4KiB
page.  Providing a hugepage is done opportunistically, it's never a hard requirement.
So throw away all of the above and see below.

> +
>  	if (is_sigpending_pfn(fault->pfn)) {
>  		kvm_handle_signal_exit(vcpu);
>  		return -EINTR;
> @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>  		return RET_PF_RETRY;
>  	}
>  
> +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
> +	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);

We have a helper for this too, gfn_round_for_level().  Ooh, but this isn't storing
a gfn, it's storing a gpa.  Naughty, naughty.
	
> +
> +	fault_flags = 0;
> +	if (fault->write)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
> +	if (fault->exec)

exec and write are mutually exclusive.  There's even documented precedent for
this in page_fault_can_be_fast().

So with a READ flag, this can be

	if (fault->write)
		fault_flags = KVM_MEMORY_FAULT_FLAG_WRITE;
	else if (fault->exec)
		fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
	else
		fault_flags = KVM_MEMORY_FAULT_FLAG_READ;

Or as Paolo would probably write it ;-)

	fault_flags = (fault->write & 1) << KVM_MEMORY_FAULT_FLAG_WRITE_SHIFT |
		      (fault->exec & 1) << KVM_MEMORY_FAULT_FLAG_EXEC_SHIFT |
		      (!fault->write && !fault->exec) << KVM_MEMORY_FAULT_FLAG_READ_SHIFT;

(that was a joke, don't actually do that)

> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
> +	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);

This is where passing a "gfn" variable as a "gpa" looks broken.

>  	return -EFAULT;

All in all, something like this?

	u64 fault_flags;

	<other error handling>

	/* comment goes here */
	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);

	if (fault->write)
		fault_flags = KVM_MEMORY_FAULT_FLAG_WRITE;
	else if (fault->exec)
		fault_flags = KVM_MEMORY_FAULT_FLAG_EXEC;
	else
		fault_flags = KVM_MEMORY_FAULT_FLAG_READ;

	kvm_handle_blahblahblah_fault(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
				      fault_flags);
Robert Hoo June 15, 2023, 2:43 a.m. UTC | #2
On 6/3/2023 12:19 AM, Anish Moorthy wrote:
> Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> kvm_handle_error_pfn().
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..cb71aae9aaec 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
>   
>   static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   {
> +	uint64_t rounded_gfn;
> +	uint64_t fault_size;
> +	uint64_t fault_flags;
> +
>   	if (is_sigpending_pfn(fault->pfn)) {
>   		kvm_handle_signal_exit(vcpu);
>   		return -EINTR;
> @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>   		return RET_PF_RETRY;
>   	}
>   
> +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);

IIUC, here fault->goal_level is always PG_LEVEL_4K.
goal_level could be adjusted in later kvm_tdp_mmu_map() --> 
kvm_mmu_hugepage_adjust(), if kvm_faultin_pfn() doesn't fail, that is to say, 
code path doesn't go through here.

I wonder, if you would like put (kind of) kvm_mmu_hugepage_adjust() here as 
well, reporting to user space the maximum map size it could do with, OR, just 
report 4K size, let user space itself to detect/decide max possible size (but 
now I've no idea how to).

> +	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
> +
> +	fault_flags = 0;
> +	if (fault->write)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
> +	if (fault->exec)
> +		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
> +	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);
>   	return -EFAULT;
>   }
>
Sean Christopherson June 15, 2023, 2:40 p.m. UTC | #3
On Thu, Jun 15, 2023, Robert Hoo wrote:
> On 6/3/2023 12:19 AM, Anish Moorthy wrote:
> > Implement KVM_CAP_MEMORY_FAULT_INFO for efaults generated by
> > kvm_handle_error_pfn().
> > 
> > Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > ---
> >   arch/x86/kvm/mmu/mmu.c | 13 +++++++++++++
> >   1 file changed, 13 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index c8961f45e3b1..cb71aae9aaec 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3291,6 +3291,10 @@ static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
> >   static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >   {
> > +	uint64_t rounded_gfn;
> > +	uint64_t fault_size;
> > +	uint64_t fault_flags;
> > +
> >   	if (is_sigpending_pfn(fault->pfn)) {
> >   		kvm_handle_signal_exit(vcpu);
> >   		return -EINTR;
> > @@ -3309,6 +3313,15 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
> >   		return RET_PF_RETRY;
> >   	}
> > +	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
> 
> IIUC, here fault->goal_level is always PG_LEVEL_4K.
> goal_level could be adjusted in later kvm_tdp_mmu_map() -->
> kvm_mmu_hugepage_adjust(), if kvm_faultin_pfn() doesn't fail, that is to
> say, code path doesn't go through here.
> 
> I wonder, if you would like put (kind of) kvm_mmu_hugepage_adjust() here as
> well, reporting to user space the maximum map size it could do with, OR,
> just report 4K size, let user space itself to detect/decide max possible
> size (but now I've no idea how to).

No, that's nonsensical because KVM uses the host mapping to compute the max
mapping level.  If there's no valid mapping, then there's no defined level.  And
as I said in my reply, KVM should never kick out to userspace if KVM can establish
a 4KiB mapping, i.e. 4KiB is always the effective scope, and reporting anything
else would just be wild speculation.
Anish Moorthy July 7, 2023, 6:05 p.m. UTC | #4
On Wed, Jun 14, 2023 at 1:03 PM Sean Christopherson <seanjc@google.com> wrote:
>
> We have a helper for this too, gfn_round_for_level().  Ooh, but this isn't storing
> a gfn, it's storing a gpa.  Naughty, naughty.

Caught in mischief I didn't even realize I was committing :O Anyways,
I've taken all the feedback you provided here. Although,

> All that said, consuming fault->goal_level is unnecessary, and not be coincidence.
> The *only* time KVM should bail to userspace is if KVM failed to faultin a 4KiB
> page.  Providing a hugepage is done opportunistically, it's never a hard requirement.
> So throw away all of the above and see below.

I wonder if my arm64 patch commits the same error. I'll send an email
over there asking Oliver about it.
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..cb71aae9aaec 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3291,6 +3291,10 @@  static void kvm_send_hwpoison_signal(struct kvm_memory_slot *slot, gfn_t gfn)
 
 static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
+	uint64_t rounded_gfn;
+	uint64_t fault_size;
+	uint64_t fault_flags;
+
 	if (is_sigpending_pfn(fault->pfn)) {
 		kvm_handle_signal_exit(vcpu);
 		return -EINTR;
@@ -3309,6 +3313,15 @@  static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	fault_size = KVM_HPAGE_SIZE(fault->goal_level);
+	rounded_gfn = round_down(fault->gfn * PAGE_SIZE, fault_size);
+
+	fault_flags = 0;
+	if (fault->write)
+		fault_flags |= KVM_MEMORY_FAULT_FLAG_WRITE;
+	if (fault->exec)
+		fault_flags |= KVM_MEMORY_FAULT_FLAG_EXEC;
+	kvm_populate_efault_info(vcpu, rounded_gfn, fault_size, fault_flags);
 	return -EFAULT;
 }