diff mbox series

[RFC,4/5,4/5] kvm-ept-idle: EPT page table walk for A bits

Message ID 20180901124811.644382292@intel.com (mailing list archive)
State New, archived
Headers show
Series introduce /proc/PID/idle_bitmap | expand

Commit Message

Fengguang Wu Sept. 1, 2018, 11:28 a.m. UTC
This borrows host page table walk macros/functions to do EPT walk.
So it depends on them using the same level.

Dave Hansen raised the concern that hottest pages may be cached in TLB and
don't frequently set the accessed bits. The solution would be to invalidate TLB
for the mm being walked, when finished one round of scan.

Warning: read() also clears the accessed bit btw, in order to avoid one more
page table walk for write(). That may not be desirable for some use cases, so
we can avoid clearing accessed bit when opened in readonly mode.

The interface should be further improved to

1) report holes and huge pages in one go
2) represent huge pages and sparse page tables efficiently

(1) can be trivially fixed by extending the bitmap to more bits per PAGE_SIZE.

(2) would need fundemental changes to the interface. It seems existing solutions
for sparse files like SEEK_HOLE/SEEK_DATA and FIEMAP ioctl may not serve this
situation well. The most efficient way could be to fill user space read()
buffer with an array of small extents:

	struct idle_extent {
		unsigned type :  4; 
		unsigned nr   :  4; 
	};

where type can be one of

	4K_HOLE
	4K_IDLE
	4K_ACCESSED
	2M_HOLE
	2M_IDLE
	2M_ACCESSED
	1G_OR_LARGER_PAGE
	...

There can be up to 16 types, so more page sizes can be defined. The above names
are just for easy understanding the typical case. It's also possible that
PAGE_SIZE is not 4K, or PMD represents 4M pages. In which case we change type
names to more suitable ones like PTE_HOLE, PMD_ACCESSED. Since it's page table
walking, the user space should better know the exact page sizes. Either the
accessed bit or page migration are tied to the real page size.

Anyone interested in adding PTE_DIRTY or more types?

The main problem with such extent reporting interface is, the number of bytes
returned by read (variable extents) will mismatch the advanced file position
(fixed VA indexes), which is not POSIX compliant. Simple cp/cat may still work,
as they don't lseek based on read return value. If that's really a concern, we
may use ioctl() instead..

CC: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 arch/x86/kvm/ept_idle.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/ept_idle.h |  55 +++++++++++++
 2 files changed, 266 insertions(+)

Comments

Dave Hansen Sept. 6, 2018, 2:35 p.m. UTC | #1
On 09/01/2018 04:28 AM, Fengguang Wu wrote:
> (2) would need fundemental changes to the interface. It seems existing solutions
> for sparse files like SEEK_HOLE/SEEK_DATA and FIEMAP ioctl may not serve this
> situation well. The most efficient way could be to fill user space read()
> buffer with an array of small extents:

I've only been doing kernel development a few short years, but I've
learned that designing user/kernel interfaces is hard.

A comment in an RFC saying that we need "fundamental changes to the
interface" seems to be more of a cry for help than a request for
comment.  This basically says to me: ignore the interface, it's broken.

> This borrows host page table walk macros/functions to do EPT walk.
> So it depends on them using the same level.

Have you actually run this code?

How does this work?  It's walking the 'ept_root' that appears to be a
guest CR3 register value.  It doesn't appear to be the host CR3 value of
the qemu (or other hypervisor).

I'm also woefully confused why you are calling these EPT walks and then
walking the x86-style page tables.  EPT tables don't have the same
format as x86 page tables, plus they don't start at a CR3-provided value.

I'm also rather unsure that when running a VM, *any* host-format page
tables get updated A/D bits.  You need a host vaddr to do a host-format
page table walk in the host page tables, and the EPT tables do direct
guest physical to host physical translation.  There's no host vaddr
involved at all in those translations.

> +		if (!ept_pte_present(*pte) ||
> +		    !ept_pte_accessed(*pte))
> +			idle = 1;

Huh?  So, !Present and idle are treated the same?  If you had large
swaths of !Present memory, you would see that in this interface and say,
"gee, I've got a lot of idle memory to migrate" and then go do a bunch
of calls to migrate it?  That seems terminally wasteful.

Who is going to use this?

Do you have an example, at least dummy app?

Can more than one app read this data at the same time?  Who manages it?
Who owns it?  Are multiple reads destructive?

This entire set is almost entirely comment-free except for the
occasional C++ comments.  That doesn't inspire confidence.
diff mbox series

Patch

diff --git a/arch/x86/kvm/ept_idle.c b/arch/x86/kvm/ept_idle.c
index 5b97dd01011b..8a233ab8656d 100644
--- a/arch/x86/kvm/ept_idle.c
+++ b/arch/x86/kvm/ept_idle.c
@@ -9,6 +9,217 @@ 
 
 #include "ept_idle.h"
 
+static int add_to_idle_bitmap(struct ept_idle_ctrl *eic,
+			      int idle, unsigned long addr_range)
+{
+	int nbits = addr_range >> PAGE_SHIFT;
+	int bits_left = EPT_IDLE_KBUF_BITS - eic->bits_read;
+	int ret = 0;
+
+	if (nbits >= bits_left) {
+		ret = EPT_IDLE_KBUF_FULL;
+		nbits = bits_left;
+	}
+
+	// TODO: this assumes u64 == unsigned long
+	if (!idle)
+		__bitmap_clear((unsigned long *)eic->kbuf, eic->bits_read, nbits);
+	eic->bits_read += nbits;
+
+	return ret;
+}
+
+static int ept_pte_range(struct ept_idle_ctrl *eic,
+			 pmd_t *pmd, unsigned long addr, unsigned long end)
+{
+	pte_t *pte;
+	int err = 0;
+	int idle;
+
+	pte = pte_offset_kernel(pmd, addr);
+	do {
+		if (!ept_pte_present(*pte) ||
+		    !ept_pte_accessed(*pte))
+			idle = 1;
+		else {
+			idle = 0;
+			pte_clear_flags(*pte, _PAGE_EPT_ACCESSED);
+		}
+
+		err = add_to_idle_bitmap(eic, idle, PAGE_SIZE);
+		if (err)
+			break;
+	} while (pte++, addr += PAGE_SIZE, addr != end);
+
+	return err;
+}
+
+static int ept_pmd_range(struct ept_idle_ctrl *eic,
+			 pud_t *pud, unsigned long addr, unsigned long end)
+{
+	pmd_t *pmd;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pmd = pmd_offset(pud, addr);
+	do {
+		next = pmd_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pmd_present(*pmd) ||
+		    !ept_pmd_accessed(*pmd)) {
+			idle = 1;
+		} else if (pmd_large(*pmd)) {
+			idle = 0;
+			pmd_clear_flags(*pmd, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pte_range(eic, pmd, addr, next);
+		if (err)
+			break;
+	} while (pmd++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_pud_range(struct ept_idle_ctrl *eic,
+			 p4d_t *p4d, unsigned long addr, unsigned long end)
+{
+	pud_t *pud;
+	unsigned long next;
+	int err = 0;
+	int idle;
+
+	pud = pud_offset(p4d, addr);
+	do {
+		next = pud_addr_end(addr, end);
+		idle = -1;
+		if (!ept_pud_present(*pud) ||
+		    !ept_pud_accessed(*pud)) {
+			idle = 1;
+		} else if (pud_large(*pud)) {
+			idle = 0;
+			pud_clear_flags(*pud, _PAGE_EPT_ACCESSED);
+		}
+		if (idle >= 0)
+			err = add_to_idle_bitmap(eic, idle, next - addr);
+		else
+			err = ept_pmd_range(eic, pud, addr, next);
+		if (err)
+			break;
+	} while (pud++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_p4d_range(struct ept_idle_ctrl *eic,
+			 pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+	p4d_t *p4d;
+	unsigned long next;
+	int err = 0;
+
+	p4d = p4d_offset(pgd, addr);
+	do {
+		next = p4d_addr_end(addr, end);
+		if (!ept_p4d_present(*p4d))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_pud_range(eic, p4d, addr, next);
+		if (err)
+			break;
+	} while (p4d++, addr = next, addr != end);
+
+	return err;
+}
+
+static int ept_page_range(struct ept_idle_ctrl *eic,
+			  pgd_t *ept_root, unsigned long addr, unsigned long end)
+{
+	pgd_t *pgd;
+	unsigned long next;
+	int err = 0;
+
+	BUG_ON(addr >= end);
+	pgd = pgd_offset_pgd(ept_root, addr);
+	do {
+		next = pgd_addr_end(addr, end);
+		if (!ept_pgd_present(*pgd))
+			err = add_to_idle_bitmap(eic, 1, next - addr);
+		else
+			err = ept_p4d_range(eic, pgd, addr, next);
+		if (err)
+			break;
+	} while (pgd++, addr = next, addr != end);
+
+	return err;
+}
+
+static void init_ept_idle_ctrl_buffer(struct ept_idle_ctrl *eic)
+{
+	eic->bits_read = 0;
+	memset(eic->kbuf, 0xff, EPT_IDLE_KBUF_BYTES);
+}
+
+static int ept_idle_walk_gfn_range(struct ept_idle_ctrl *eic,
+				   unsigned long gfn_start,
+				   unsigned long gfn_end)
+{
+	struct kvm_vcpu *vcpu = kvm_get_vcpu(eic->kvm, 0);
+	struct kvm_mmu *mmu = &vcpu->arch.mmu;
+	pgd_t *ept_root;
+	unsigned long gpa_start = gfn_to_gpa(gfn_start);
+	unsigned long gpa_end = gfn_to_gpa(gfn_end);
+	unsigned long gpa_addr;
+	int bytes_read;
+	int ret = -EINVAL;
+
+	init_ept_idle_ctrl_buffer(eic);
+
+	spin_lock(&eic->kvm->mmu_lock);
+	if (mmu->base_role.ad_disabled) {
+		printk(KERN_NOTICE "CPU does not support EPT A/D bits tracking\n");
+		goto out_unlock;
+	}
+
+	if (mmu->shadow_root_level != 4 + (!!pgtable_l5_enabled())) {
+		printk(KERN_NOTICE "Unsupported EPT level %d\n", mmu->shadow_root_level);
+		goto out_unlock;
+	}
+
+	if (!VALID_PAGE(mmu->root_hpa)) {
+		goto out_unlock;
+	}
+
+	ept_root = __va(mmu->root_hpa);
+
+	for (gpa_addr = gpa_start;
+	     gpa_addr < gpa_end;
+	     gpa_addr += EPT_IDLE_KBUF_BITS << PAGE_SHIFT) {
+		ept_page_range(eic, ept_root, gpa_addr, gpa_end);
+		spin_unlock(&eic->kvm->mmu_lock);
+
+		bytes_read = eic->bits_read / 8;
+
+		ret = copy_to_user(eic->buf, eic->kbuf, bytes_read);
+		if (ret)
+			return -EFAULT;
+
+		eic->buf += bytes_read;
+		eic->bytes_copied += bytes_read;
+		if (eic->bytes_copied >= eic->buf_size)
+			return 0;
+
+		init_ept_idle_ctrl_buffer(eic);
+		cond_resched();
+		spin_lock(&eic->kvm->mmu_lock);
+	}
+out_unlock:
+	spin_unlock(&eic->kvm->mmu_lock);
+	return ret;
+}
 
 // mindless copy from kvm_handle_hva_range().
 // TODO: handle order and hole.
diff --git a/arch/x86/kvm/ept_idle.h b/arch/x86/kvm/ept_idle.h
index e0b9dcecf50b..b715ec7b7513 100644
--- a/arch/x86/kvm/ept_idle.h
+++ b/arch/x86/kvm/ept_idle.h
@@ -1,6 +1,61 @@ 
 #ifndef _EPT_IDLE_H
 #define _EPT_IDLE_H
 
+#define _PAGE_BIT_EPT_ACCESSED 8
+#define _PAGE_EPT_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_EPT_ACCESSED)
+
+#define _PAGE_EPT_PRESENT	(_AT(pteval_t, 7))
+
+static inline int ept_pte_present(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pmd_present(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pud_present(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_p4d_present(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pgd_present(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_PRESENT;
+}
+
+static inline int ept_pte_accessed(pte_t a)
+{
+	return pte_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pmd_accessed(pmd_t a)
+{
+	return pmd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pud_accessed(pud_t a)
+{
+	return pud_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_p4d_accessed(p4d_t a)
+{
+	return p4d_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
+static inline int ept_pgd_accessed(pgd_t a)
+{
+	return pgd_flags(a) & _PAGE_EPT_ACCESSED;
+}
+
 #define IDLE_BITMAP_CHUNK_SIZE	sizeof(u64)
 #define IDLE_BITMAP_CHUNK_BITS	(IDLE_BITMAP_CHUNK_SIZE * BITS_PER_BYTE)