diff mbox series

[RFC,4/5] KVM: x86: aggressively map PTEs in KVM_MEM_ALLONES slots

Message ID 20200514180540.52407-5-vkuznets@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: x86: KVM_MEM_ALLONES memory | expand

Commit Message

Vitaly Kuznetsov May 14, 2020, 6:05 p.m. UTC
All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
in KVM so instead of mapping each page upon first access we can map
everything aggressively.

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
 2 files changed, 39 insertions(+), 4 deletions(-)

Comments

Sean Christopherson May 14, 2020, 7:46 p.m. UTC | #1
On Thu, May 14, 2020 at 08:05:39PM +0200, Vitaly Kuznetsov wrote:
> All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
> in KVM so instead of mapping each page upon first access we can map
> everything aggressively.
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
>  2 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3db499df2dfc..e92ca9ed3ff5 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4154,8 +4154,24 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  		goto out_unlock;
>  	if (make_mmu_pages_available(vcpu) < 0)
>  		goto out_unlock;
> -	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> -			 prefault, is_tdp && lpage_disallowed);
> +
> +	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {

The 'write' check is wrong.  More specifically, patch 2/5 is missing code
to add KVM_MEM_ALLONES to memslot_is_readonly().  If we end up going with
an actual kvm_allones_pg backing, writes to an ALLONES memslots should be
handled same as writes to RO memslots; MMIO occurs but no MMIO spte is
created.

> +		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
> +				 prefault, is_tdp && lpage_disallowed);
> +	} else {
> +		/*
> +		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
> +		 * readonly 'allones' page, map all PTEs aggressively here.
> +		 */
> +		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
> +		     gfn++) {
> +			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
> +					 map_writable, max_level, pfn, prefault,
> +					 is_tdp && lpage_disallowed);

IMO this is a waste of memory and TLB entries.  Why not treat the access as
the MMIO it is and emulate the access with a 0xff return value?  I think
it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
kvm_allones_pg wouldn't be needed.  I would even vote to never create an
MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
performance sensitive.

Regarding memory, looping wantonly on __direct_map() will eventually trigger
the BUG_ON() in mmu_memory_cache_alloc().  mmu_topup_memory_caches() only
ensures there are enough objects available to map a single translation, i.e.
one entry per level, sans the root[*].

[*] The gorilla math in mmu_topup_memory_caches() is horrendously misleading,
    e.g. the '8' pages is really 2*(ROOT_LEVEL - 1), but the 2x part has been
    obsolete for the better part of a decade, and the '- 1' wasn't actually
    originally intended or needed, but is now required because of 5-level
    paging.  I have the beginning of a series to clean up that mess; it was
    low on my todo list because I didn't expect anyone to be mucking with
    related code :-)

> +			if (r)
> +				break;
> +		}
> +	}
Vitaly Kuznetsov May 15, 2020, 8:36 a.m. UTC | #2
Sean Christopherson <sean.j.christopherson@intel.com> writes:

> On Thu, May 14, 2020 at 08:05:39PM +0200, Vitaly Kuznetsov wrote:
>> All PTEs in KVM_MEM_ALLONES slots point to the same read-only page
>> in KVM so instead of mapping each page upon first access we can map
>> everything aggressively.
>> 
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>>  arch/x86/kvm/mmu/mmu.c         | 20 ++++++++++++++++++--
>>  arch/x86/kvm/mmu/paging_tmpl.h | 23 +++++++++++++++++++++--
>>  2 files changed, 39 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 3db499df2dfc..e92ca9ed3ff5 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -4154,8 +4154,24 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  		goto out_unlock;
>>  	if (make_mmu_pages_available(vcpu) < 0)
>>  		goto out_unlock;
>> -	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
>> -			 prefault, is_tdp && lpage_disallowed);
>> +
>> +	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {
>
> The 'write' check is wrong.  More specifically, patch 2/5 is missing code
> to add KVM_MEM_ALLONES to memslot_is_readonly().  If we end up going with
> an actual kvm_allones_pg backing, writes to an ALLONES memslots should be
> handled same as writes to RO memslots; MMIO occurs but no MMIO spte is
> created.
>

Missed that, thanks!

>> +		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
>> +				 prefault, is_tdp && lpage_disallowed);
>> +	} else {
>> +		/*
>> +		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
>> +		 * readonly 'allones' page, map all PTEs aggressively here.
>> +		 */
>> +		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
>> +		     gfn++) {
>> +			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
>> +					 map_writable, max_level, pfn, prefault,
>> +					 is_tdp && lpage_disallowed);
>
> IMO this is a waste of memory and TLB entries.  Why not treat the access as
> the MMIO it is and emulate the access with a 0xff return value?  I think
> it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
> kvm_allones_pg wouldn't be needed.  I would even vote to never create an
> MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
> performance sensitive.

You're trying to defeat the sole purpose of the feature :-) I also saw
the option you suggest but Michael convinced me we should go further.

The idea (besides memory waste) was that the time we spend on PCI scan
during boot is significant. Unfortunatelly, I don't have any numbers but
we can certainly try to get them. With this feature (AFAIU) we're not
aiming at 'classic' long-living VMs but rather at something like Kata
containers/FaaS/... where boot time is crucial.

>
> Regarding memory, looping wantonly on __direct_map() will eventually trigger
> the BUG_ON() in mmu_memory_cache_alloc().  mmu_topup_memory_caches() only
> ensures there are enough objects available to map a single translation, i.e.
> one entry per level, sans the root[*].
>
> [*] The gorilla math in mmu_topup_memory_caches() is horrendously misleading,
>     e.g. the '8' pages is really 2*(ROOT_LEVEL - 1), but the 2x part has been
>     obsolete for the better part of a decade, and the '- 1' wasn't actually
>     originally intended or needed, but is now required because of 5-level
>     paging.  I have the beginning of a series to clean up that mess; it was
>     low on my todo list because I didn't expect anyone to be mucking with
>     related code :-)

I missed that too but oh well, this is famous KVM MMU, I should't feel
that bad about it :-) Thanks for your review!

>
>> +			if (r)
>> +				break;
>> +		}
>> +	}
>
Sean Christopherson May 15, 2020, 1:58 p.m. UTC | #3
On Fri, May 15, 2020 at 10:36:19AM +0200, Vitaly Kuznetsov wrote:
> Sean Christopherson <sean.j.christopherson@intel.com> writes:
> > IMO this is a waste of memory and TLB entries.  Why not treat the access as
> > the MMIO it is and emulate the access with a 0xff return value?  I think
> > it'd be a simple change to have __kvm_read_guest_page() stuff 0xff, i.e. a
> > kvm_allones_pg wouldn't be needed.  I would even vote to never create an
> > MMIO SPTE.  The guest has bigger issues if reading from a PCI hole is
> > performance sensitive.
> 
> You're trying to defeat the sole purpose of the feature :-) I also saw
> the option you suggest but Michael convinced me we should go further.
> 
> The idea (besides memory waste) was that the time we spend on PCI scan
> during boot is significant.

Put that in the cover letter.  The impression I got from the current cover
letter is that the focus was entirely on memory consumption.

> Unfortunatelly, I don't have any numbers but we can certainly try to get
> them.

Numbers are definitely required, otherwise we'll have no idea whether doing
something like the agressive prefetch actually has a meaningful impact.

> With this feature (AFAIU) we're not aiming at 'classic' long-living VMs but
> rather at something like Kata containers/FaaS/... where boot time is crucial.

Isn't the guest kernel fully controlled by the VMM in those use cases?
Why not enlighten the guest kernel in some way so that it doesn't have to
spend time scanning PCI space in the first place?
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3db499df2dfc..e92ca9ed3ff5 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4154,8 +4154,24 @@  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 		goto out_unlock;
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
-			 prefault, is_tdp && lpage_disallowed);
+
+	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write)) {
+		r = __direct_map(vcpu, gpa, write, map_writable, max_level, pfn,
+				 prefault, is_tdp && lpage_disallowed);
+	} else {
+		/*
+		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
+		 * readonly 'allones' page, map all PTEs aggressively here.
+		 */
+		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
+		     gfn++) {
+			r = __direct_map(vcpu, gfn << PAGE_SHIFT, write,
+					 map_writable, max_level, pfn, prefault,
+					 is_tdp && lpage_disallowed);
+			if (r)
+				break;
+		}
+	}
 
 out_unlock:
 	spin_unlock(&vcpu->kvm->mmu_lock);
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 98e368788e8b..7bf0c48b858f 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -789,6 +789,7 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	bool lpage_disallowed = (error_code & PFERR_FETCH_MASK) &&
 				is_nx_huge_page_enabled();
 	int max_level;
+	gfn_t gfn;
 
 	pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
 
@@ -873,8 +874,26 @@  static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
 	kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 	if (make_mmu_pages_available(vcpu) < 0)
 		goto out_unlock;
-	r = FNAME(fetch)(vcpu, addr, &walker, write_fault, max_level, pfn,
-			 map_writable, prefault, lpage_disallowed);
+	if (likely(!(slot->flags & KVM_MEM_ALLONES) || write_fault)) {
+		r = FNAME(fetch)(vcpu, addr, &walker, write_fault, max_level,
+				 pfn, map_writable, prefault, lpage_disallowed);
+	} else {
+		/*
+		 * KVM_MEM_ALLONES are 4k only slots fully mapped to the same
+		 * readonly 'allones' page, map all PTEs aggressively here.
+		 */
+		for (gfn = slot->base_gfn; gfn < slot->base_gfn + slot->npages;
+		     gfn++) {
+			walker.gfn = gfn;
+			r = FNAME(fetch)(vcpu, gfn << PAGE_SHIFT, &walker,
+					 write_fault, max_level, pfn,
+					 map_writable, prefault,
+					 lpage_disallowed);
+			if (r)
+				break;
+		}
+	}
+
 	kvm_mmu_audit(vcpu, AUDIT_POST_PAGE_FAULT);
 
 out_unlock: