diff mbox series

[kvm-unit-tests,v2,4/4] x86: nSVM: Build up the nested page table dynamically

Message ID 20220324053046.200556-5-manali.shukla@amd.com (mailing list archive)
State New, archived
Headers show
Series Move npt test cases and NPT code improvements | expand

Commit Message

Manali Shukla March 24, 2022, 5:30 a.m. UTC
Current implementation of nested page table does the page
table build up statistically with 2048 PTEs and one pml4 entry.
That is why current implementation is not extensible.

New implementation does page table build up dynamically based
on the RAM size of the VM which enables us to have separate
memory range to test various npt test cases.

Signed-off-by: Manali Shukla <manali.shukla@amd.com>
---
 x86/svm.c     | 163 ++++++++++++++++++++++++++++++++++----------------
 x86/svm.h     |  17 +++++-
 x86/svm_npt.c |   4 +-
 3 files changed, 130 insertions(+), 54 deletions(-)

Comments

Sean Christopherson April 13, 2022, 9:33 p.m. UTC | #1
On Thu, Mar 24, 2022, Manali Shukla wrote:
> Current implementation of nested page table does the page
> table build up statistically with 2048 PTEs and one pml4 entry.
> That is why current implementation is not extensible.
> 
> New implementation does page table build up dynamically based
> on the RAM size of the VM which enables us to have separate
> memory range to test various npt test cases.
> 
> Signed-off-by: Manali Shukla <manali.shukla@amd.com>
> ---
>  x86/svm.c     | 163 ++++++++++++++++++++++++++++++++++----------------

Ok, so I got fairly far into reviewing this (see below, but it can be ignored)
before realizing that all this new code is nearly identical to what's in lib/x86/vm.c.
E.g. find_pte_level() and install_pte() can probably used almost verbatim.

Instead of duplicating code, can you extend vm.c to as necessary?  It might not
even require any changes.  I'll happily clean up vm.c in the future, e.g. to fix
the misleading nomenclature and open coded horrors, but for your purposes I think
you should be able to get away with a bare minimum of changes.

>  x86/svm.h     |  17 +++++-
>  x86/svm_npt.c |   4 +-
>  3 files changed, 130 insertions(+), 54 deletions(-)
> 
> diff --git a/x86/svm.c b/x86/svm.c
> index d0d523a..67dbe31 100644
> --- a/x86/svm.c
> +++ b/x86/svm.c
> @@ -8,6 +8,7 @@
>  #include "desc.h"
>  #include "msr.h"
>  #include "vm.h"
> +#include "fwcfg.h"
>  #include "smp.h"
>  #include "types.h"
>  #include "alloc_page.h"
> @@ -16,38 +17,67 @@
>  #include "vmalloc.h"
>  
>  /* for the nested page table*/
> -u64 *pte[2048];
> -u64 *pde[4];
> -u64 *pdpe;
>  u64 *pml4e;
>  
>  struct vmcb *vmcb;
>  
> -u64 *npt_get_pte(u64 address)
> +u64* get_npt_pte(u64 *pml4,

Heh, the usual way to handle wrappers is to add underscores, i.e.

u64 *npt_get_pte(u64 address)
{
    return __npt_get_pte(npt_get_pml4e(), address, 1);
}

swapping the order just results in namespacing wierdness and doesn't convey to the
reader that this is an "inner" helper.

> u64 guest_addr, int level)

Assuming guest_addr is a gpa, call it gpa to avoid ambiguity over virtual vs.
physical.

>  {
> -	int i1, i2;
> +    int l;
> +    u64 *pt = pml4, iter_pte;

Please point pointers and non-pointers on separate lines.  And just "pte" for
the tmp, it's not actually used as an iterator.  And with that, I have a slight
preference for page_table over pt so that it's not mistaken for pte.

> +    unsigned offset;

No bare unsigned please.  And "offset" is the wrong terminology, "index" or "idx"
is preferable.  An offset is usually an offset in bytes, this indexes into a u64
array.

Ugh, looks like that awful name comes from PGDIR_OFFSET in lib/x86/asm/page.h.
The offset, at least in Intel SDM terminology, it specifically the last N:0 bits
of the virtual address (or guest physical) that are the offset into the physical
page, e.g. 11:0 for a 4kb page, 20:0 for a 2mb page.

> +
> +    assert(level >= 1 && level <= 4);

The upper bound should be NPT_PAGE_LEVEL, or root_level (see below).

> +    for(l = NPT_PAGE_LEVEL; ; --l) {

Nit, need a space after "for".

Also, can you plumb in the root level?  E.g. have npt_get_pte() hardcode the
root in this case.  At some point this will hopefully support 5-level NPT, at
which point hardcoding the root will require updating more code than should be
necessary.

> +        offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
> +                 & NPT_PGDIR_MASK;

Not your code (I think), but NPT_PGDIR_MASK is an odd name since it's common to
all.  The easiest thing would be to loosely follow KVM.  Actually, I think it
makes sense to grab the PT64_ stuff from KVM

#define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
#define PT64_LEVEL_BITS 9
#define PT64_LEVEL_SHIFT(level) \
		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
#define PT64_INDEX(address, level)\
	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))


and then use those instead of having dedicated NPT_* defines.  That makes it more
obvious that (a) SVM/NPT tests are 64-bit only and (b) there's nothing special
about NPT with respect to "legacy" 64-bit paging.

That will provide a nice macro, PT64_INDEX, to replace the open coded calcuations.

> +        if (l == level)
> +            break;
> +        if (!(iter_pte & NPT_PRESENT))
> +            return false;

Return "false" works, but it's all kinds of wrong.  This should either assert or
return NULL.

> +        pt = (u64*)(iter_pte & PT_ADDR_MASK);
> +    }
> +    offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
> +             & NPT_PGDIR_MASK;

Hmm, this is unnecessary because the for-loop can't terminate on its own, it
can only exit on "l == level", and offset is already correct in that case.
Shukla, Manali April 14, 2022, 4:39 p.m. UTC | #2
On 4/14/2022 3:03 AM, Sean Christopherson wrote:
> On Thu, Mar 24, 2022, Manali Shukla wrote:
>> Current implementation of nested page table does the page
>> table build up statistically with 2048 PTEs and one pml4 entry.
>> That is why current implementation is not extensible.
>>
>> New implementation does page table build up dynamically based
>> on the RAM size of the VM which enables us to have separate
>> memory range to test various npt test cases.
>>
>> Signed-off-by: Manali Shukla <manali.shukla@amd.com>
>> ---
>>  x86/svm.c     | 163 ++++++++++++++++++++++++++++++++++----------------
> 
> Ok, so I got fairly far into reviewing this (see below, but it can be ignored)
> before realizing that all this new code is nearly identical to what's in lib/x86/vm.c.
> E.g. find_pte_level() and install_pte() can probably used almost verbatim.
> 
> Instead of duplicating code, can you extend vm.c to as necessary?  It might not
> even require any changes.  I'll happily clean up vm.c in the future, e.g. to fix
> the misleading nomenclature and open coded horrors, but for your purposes I think
> you should be able to get away with a bare minimum of changes.
> 
>>  x86/svm.h     |  17 +++++-
>>  x86/svm_npt.c |   4 +-
>>  3 files changed, 130 insertions(+), 54 deletions(-)
>>
>> diff --git a/x86/svm.c b/x86/svm.c
>> index d0d523a..67dbe31 100644
>> --- a/x86/svm.c
>> +++ b/x86/svm.c
>> @@ -8,6 +8,7 @@
>>  #include "desc.h"
>>  #include "msr.h"
>>  #include "vm.h"
>> +#include "fwcfg.h"
>>  #include "smp.h"
>>  #include "types.h"
>>  #include "alloc_page.h"
>> @@ -16,38 +17,67 @@
>>  #include "vmalloc.h"
>>  
>>  /* for the nested page table*/
>> -u64 *pte[2048];
>> -u64 *pde[4];
>> -u64 *pdpe;
>>  u64 *pml4e;
>>  
>>  struct vmcb *vmcb;
>>  
>> -u64 *npt_get_pte(u64 address)
>> +u64* get_npt_pte(u64 *pml4,
> 
> Heh, the usual way to handle wrappers is to add underscores, i.e.
> 
> u64 *npt_get_pte(u64 address)
> {
>     return __npt_get_pte(npt_get_pml4e(), address, 1);
> }
> 
> swapping the order just results in namespacing wierdness and doesn't convey to the
> reader that this is an "inner" helper.
> 
>> u64 guest_addr, int level)
> 
> Assuming guest_addr is a gpa, call it gpa to avoid ambiguity over virtual vs.
> physical.
> 
>>  {
>> -	int i1, i2;
>> +    int l;
>> +    u64 *pt = pml4, iter_pte;
> 
> Please point pointers and non-pointers on separate lines.  And just "pte" for
> the tmp, it's not actually used as an iterator.  And with that, I have a slight
> preference for page_table over pt so that it's not mistaken for pte.
> 
>> +    unsigned offset;
> 
> No bare unsigned please.  And "offset" is the wrong terminology, "index" or "idx"
> is preferable.  An offset is usually an offset in bytes, this indexes into a u64
> array.
> 
> Ugh, looks like that awful name comes from PGDIR_OFFSET in lib/x86/asm/page.h.
> The offset, at least in Intel SDM terminology, it specifically the last N:0 bits
> of the virtual address (or guest physical) that are the offset into the physical
> page, e.g. 11:0 for a 4kb page, 20:0 for a 2mb page.
> 
>> +
>> +    assert(level >= 1 && level <= 4);
> 
> The upper bound should be NPT_PAGE_LEVEL, or root_level (see below).
> 
>> +    for(l = NPT_PAGE_LEVEL; ; --l) {
> 
> Nit, need a space after "for".
> 
> Also, can you plumb in the root level?  E.g. have npt_get_pte() hardcode the
> root in this case.  At some point this will hopefully support 5-level NPT, at
> which point hardcoding the root will require updating more code than should be
> necessary.
> 
>> +        offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
>> +                 & NPT_PGDIR_MASK;
> 
> Not your code (I think), but NPT_PGDIR_MASK is an odd name since it's common to
> all.  The easiest thing would be to loosely follow KVM.  Actually, I think it
> makes sense to grab the PT64_ stuff from KVM
> 
> #define PT64_BASE_ADDR_MASK (((1ULL << 52) - 1) & ~(u64)(PAGE_SIZE-1))
> #define PT64_LEVEL_BITS 9
> #define PT64_LEVEL_SHIFT(level) \
> 		(PAGE_SHIFT + (level - 1) * PT64_LEVEL_BITS)
> #define PT64_INDEX(address, level)\
> 	(((address) >> PT64_LEVEL_SHIFT(level)) & ((1 << PT64_LEVEL_BITS) - 1))
> 
> 
> and then use those instead of having dedicated NPT_* defines.  That makes it more
> obvious that (a) SVM/NPT tests are 64-bit only and (b) there's nothing special
> about NPT with respect to "legacy" 64-bit paging.
> 
> That will provide a nice macro, PT64_INDEX, to replace the open coded calcuations.
> 
>> +        if (l == level)
>> +            break;
>> +        if (!(iter_pte & NPT_PRESENT))
>> +            return false;
> 
> Return "false" works, but it's all kinds of wrong.  This should either assert or
> return NULL.
> 
>> +        pt = (u64*)(iter_pte & PT_ADDR_MASK);
>> +    }
>> +    offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
>> +             & NPT_PGDIR_MASK;
> 
> Hmm, this is unnecessary because the for-loop can't terminate on its own, it
> can only exit on "l == level", and offset is already correct in that case.

Hey Sean,

Thank you so much for reviewing the code.

I will work on the comments.

- Manali
diff mbox series

Patch

diff --git a/x86/svm.c b/x86/svm.c
index d0d523a..67dbe31 100644
--- a/x86/svm.c
+++ b/x86/svm.c
@@ -8,6 +8,7 @@ 
 #include "desc.h"
 #include "msr.h"
 #include "vm.h"
+#include "fwcfg.h"
 #include "smp.h"
 #include "types.h"
 #include "alloc_page.h"
@@ -16,38 +17,67 @@ 
 #include "vmalloc.h"
 
 /* for the nested page table*/
-u64 *pte[2048];
-u64 *pde[4];
-u64 *pdpe;
 u64 *pml4e;
 
 struct vmcb *vmcb;
 
-u64 *npt_get_pte(u64 address)
+u64* get_npt_pte(u64 *pml4, u64 guest_addr, int level)
 {
-	int i1, i2;
+    int l;
+    u64 *pt = pml4, iter_pte;
+    unsigned offset;
+
+    assert(level >= 1 && level <= 4);
+
+    for(l = NPT_PAGE_LEVEL; ; --l) {
+        offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
+                 & NPT_PGDIR_MASK;
+        iter_pte = pt[offset];
+        if (l == level)
+            break;
+        if (!(iter_pte & NPT_PRESENT))
+            return false;
+        pt = (u64*)(iter_pte & PT_ADDR_MASK);
+    }
+    offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
+             & NPT_PGDIR_MASK;
+    return  &pt[offset];
+}
 
-	address >>= 12;
-	i1 = (address >> 9) & 0x7ff;
-	i2 = address & 0x1ff;
+void set_npt_pte(u64 *pml4, u64 guest_addr,
+        int level, u64  pte_val)
+{
+    int l;
+    unsigned long *pt = pml4;
+    unsigned offset;
+
+    for (l = NPT_PAGE_LEVEL; ; --l) {
+        offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
+                 & NPT_PGDIR_MASK;
+        if (l == level)
+            break;
+        if (!(pt[offset] & NPT_PRESENT))
+            return;
+        pt = (u64*)(pt[offset] & PT_ADDR_MASK);
+    }
+    offset = (guest_addr >> (((l - 1) * NPT_PGDIR_WIDTH) + 12))
+              & NPT_PGDIR_MASK;
+    pt[offset] = pte_val;
+}
 
-	return &pte[i1][i2];
+u64 *npt_get_pte(u64 address)
+{
+    return get_npt_pte(npt_get_pml4e(), address, 1);
 }
 
 u64 *npt_get_pde(u64 address)
 {
-	int i1, i2;
-
-	address >>= 21;
-	i1 = (address >> 9) & 0x3;
-	i2 = address & 0x1ff;
-
-	return &pde[i1][i2];
+    return get_npt_pte(npt_get_pml4e(), address, 2);
 }
 
-u64 *npt_get_pdpe(void)
+u64 *npt_get_pdpe(u64 address)
 {
-	return pdpe;
+	return get_npt_pte(npt_get_pml4e(), address, 3);
 }
 
 u64 *npt_get_pml4e(void)
@@ -309,11 +339,72 @@  static void set_additional_vcpu_msr(void *msr_efer)
 	wrmsr(MSR_EFER, (ulong)msr_efer | EFER_SVME);
 }
 
+static void install_npt_entry(u64 *pml4,
+        int pte_level,
+        u64 guest_addr,
+        u64 pte,
+        u64 *pt_page)
+{
+    int level;
+    unsigned long *pt = pml4;
+    unsigned offset;
+
+    for (level = NPT_PAGE_LEVEL; level > pte_level; --level) {
+        offset = (guest_addr >> (((level - 1) * NPT_PGDIR_WIDTH) + 12))
+                  & NPT_PGDIR_MASK;
+        if (!(pt[offset] & PT_PRESENT_MASK)) {
+            unsigned long *new_pt = pt_page;
+            if (!new_pt) {
+                new_pt = alloc_page();
+            } else
+                pt_page = 0;
+            memset(new_pt, 0, PAGE_SIZE);
+
+            pt[offset] = virt_to_phys(new_pt) | NPT_USER_ACCESS |
+                         NPT_ACCESS_BIT | NPT_PRESENT | NPT_RW_ACCESS;
+        }
+        pt = phys_to_virt(pt[offset] & PT_ADDR_MASK);
+    }
+    offset = (guest_addr >> (((level - 1) * NPT_PGDIR_WIDTH) + 12))
+              & NPT_PGDIR_MASK;
+    pt[offset] = pte | NPT_USER_ACCESS | NPT_ACCESS_BIT | NPT_DIRTY_BIT;
+}
+
+void install_npt(u64 *pml4, u64 phys, u64 guest_addr,
+        u64 perm)
+{
+    install_npt_entry(pml4, 1, guest_addr,
+            (phys & PAGE_MASK) | perm, 0);
+}
+
+static void setup_npt_range(u64 *pml4, u64 start,
+        u64 len, u64 perm)
+{
+    u64 phys = start;
+    u64 max = (u64)len + (u64)start;
+
+    while (phys + PAGE_SIZE <= max) {
+        install_npt(pml4, phys, phys, perm);
+        phys += PAGE_SIZE;
+    }
+}
+
+void setup_npt(void) {
+    u64 end_of_memory;
+    pml4e = alloc_page();
+
+    end_of_memory = fwcfg_get_u64(FW_CFG_RAM_SIZE);
+    if (end_of_memory < (1ul << 32))
+        end_of_memory = (1ul << 32);
+
+    setup_npt_range(pml4e, 0, end_of_memory,
+            NPT_PRESENT | NPT_RW_ACCESS);
+}
+
 static void setup_svm(void)
 {
 	void *hsave = alloc_page();
-	u64 *page, address;
-	int i,j;
+    int i;
 
 	wrmsr(MSR_VM_HSAVE_PA, virt_to_phys(hsave));
 	wrmsr(MSR_EFER, rdmsr(MSR_EFER) | EFER_SVME);
@@ -335,37 +426,7 @@  static void setup_svm(void)
 	* Build the page-table bottom-up and map everything with 4k
 	* pages to get enough granularity for the NPT unit-tests.
 	*/
-
-	address = 0;
-
-	/* PTE level */
-	for (i = 0; i < 2048; ++i) {
-		page = alloc_page();
-
-		for (j = 0; j < 512; ++j, address += 4096)
-	    		page[j] = address | 0x067ULL;
-
-		pte[i] = page;
-	}
-
-	/* PDE level */
-	for (i = 0; i < 4; ++i) {
-		page = alloc_page();
-
-	for (j = 0; j < 512; ++j)
-	    page[j] = (u64)pte[(i * 512) + j] | 0x027ULL;
-
-		pde[i] = page;
-	}
-
-	/* PDPe level */
-	pdpe   = alloc_page();
-	for (i = 0; i < 4; ++i)
-		pdpe[i] = ((u64)(pde[i])) | 0x27;
-
-	/* PML4e level */
-	pml4e    = alloc_page();
-	pml4e[0] = ((u64)pdpe) | 0x27;
+    setup_npt();
 }
 
 int matched;
diff --git a/x86/svm.h b/x86/svm.h
index 9ab3aa5..7815f56 100644
--- a/x86/svm.h
+++ b/x86/svm.h
@@ -147,6 +147,17 @@  struct __attribute__ ((__packed__)) vmcb_control_area {
 #define SVM_VM_CR_SVM_LOCK_MASK 0x0008ULL
 #define SVM_VM_CR_SVM_DIS_MASK  0x0010ULL
 
+#define NPT_PAGE_LEVEL      4
+#define NPT_PGDIR_WIDTH     9
+#define NPT_PGDIR_MASK      511
+
+#define NPT_PRESENT         (1ul << 0)
+#define NPT_RW_ACCESS       (1ul << 1)
+#define NPT_USER_ACCESS     (1ul << 2)
+#define NPT_ACCESS_BIT      (1ul << 5)
+#define NPT_DIRTY_BIT       (1ul << 6)
+#define NPT_NX_ACCESS       (1ul << 63)
+
 struct __attribute__ ((__packed__)) vmcb_seg {
 	u16 selector;
 	u16 attrib;
@@ -401,7 +412,7 @@  typedef void (*test_guest_func)(struct svm_test *);
 int run_svm_tests(int ac, char **av);
 u64 *npt_get_pte(u64 address);
 u64 *npt_get_pde(u64 address);
-u64 *npt_get_pdpe(void);
+u64 *npt_get_pdpe(u64 address);
 u64 *npt_get_pml4e(void);
 bool smp_supported(void);
 bool default_supported(void);
@@ -418,7 +429,11 @@  struct regs get_regs(void);
 void vmmcall(void);
 int __svm_vmrun(u64 rip);
 int svm_vmrun(void);
+void setup_npt(void);
 void test_set_guest(test_guest_func func);
+void install_npt(u64 *pml4, u64 phys, u64 guest_addr, u64 perm);
+u64* get_npt_pte(u64 *pml4, u64 guest_addr, int level);
+void set_npt_pte(u64 *pml4, u64 guest_addr, int level, u64  pte_val);
 
 extern struct vmcb *vmcb;
 extern struct svm_test svm_tests[];
diff --git a/x86/svm_npt.c b/x86/svm_npt.c
index 4f80d9a..4f95ae0 100644
--- a/x86/svm_npt.c
+++ b/x86/svm_npt.c
@@ -208,7 +208,7 @@  static void __svm_npt_rsvd_bits_test(u64 *pxe, u64 rsvd_bits, u64 efer,
     report(exit_reason == SVM_EXIT_NPF,
            "Wanted #NPF on rsvd bits = 0x%lx, got exit = 0x%x", rsvd_bits, exit_reason);
 
-    if (pxe == npt_get_pdpe() || pxe == npt_get_pml4e()) {
+    if (pxe == npt_get_pdpe((u64)basic_guest_main) || pxe == npt_get_pml4e()) {
         /*
          * The guest's page tables will blow up on a bad PDPE/PML4E,
          * before starting the final walk of the guest page.
@@ -336,7 +336,7 @@  skip_pte_test:
                 get_random_bits(20, 13) | PT_PAGE_SIZE_MASK,
                 host_efer, host_cr4, guest_efer, guest_cr4);
 
-    _svm_npt_rsvd_bits_test(npt_get_pdpe(),
+    _svm_npt_rsvd_bits_test(npt_get_pdpe((u64)basic_guest_main),
                 PT_PAGE_SIZE_MASK |
                     (this_cpu_has(X86_FEATURE_GBPAGES) ? get_random_bits(29, 13) : 0),
                 host_efer, host_cr4, guest_efer, guest_cr4);