diff mbox series

[1/6] KVM: Cache the least recently used slot index per vCPU

Message ID 20210730223707.4083785-2-dmatlack@google.com (mailing list archive)
State New, archived
Headers show
Series Improve gfn-to-memslot performance during page faults | expand

Commit Message

David Matlack July 30, 2021, 10:37 p.m. UTC
The memslot for a given gfn is looked up multiple times during page
fault handling. Avoid binary searching for it multiple times by caching
the least recently used slot. There is an existing VM-wide LRU slot but
that does not work well for cases where vCPUs are accessing memory in
different slots (see performance data below).

Another benefit of caching the least recently use slot (versus looking
up the slot once and passing around a pointer) is speeding up memslot
lookups *across* faults and during spte prefetching.

To measure the performance of this change I ran dirty_log_perf_test with
64 vCPUs and 64 memslots and measured "Populate memory time" and
"Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
dirty logging to use fast_page_fault so its performance could be
measured.

Config     | Metric                        | Before | After
---------- | ----------------------------- | ------ | ------
tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s

The "Iteration 2 dirty memory time" results are especially compelling
because they are equivalent to running the same test with a single
memslot. In other words, fast_page_fault performance no longer scales
with the number of memslots.

Signed-off-by: David Matlack <dmatlack@google.com>
---
 include/linux/kvm_host.h | 61 ++++++++++++++++++++++++++++++----------
 virt/kvm/kvm_main.c      | 21 +++++++++++++-
 2 files changed, 66 insertions(+), 16 deletions(-)

Comments

Paolo Bonzini Aug. 2, 2021, 2:36 p.m. UTC | #1
On 31/07/21 00:37, David Matlack wrote:
> The memslot for a given gfn is looked up multiple times during page
> fault handling. Avoid binary searching for it multiple times by caching
> the least recently used slot. There is an existing VM-wide LRU slot but
> that does not work well for cases where vCPUs are accessing memory in
> different slots (see performance data below).
> 
> Another benefit of caching the least recently use slot (versus looking
> up the slot once and passing around a pointer) is speeding up memslot
> lookups *across* faults and during spte prefetching.
> 
> To measure the performance of this change I ran dirty_log_perf_test with
> 64 vCPUs and 64 memslots and measured "Populate memory time" and
> "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
> dirty logging to use fast_page_fault so its performance could be
> measured.
> 
> Config     | Metric                        | Before | After
> ---------- | ----------------------------- | ------ | ------
> tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
> tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
> tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
> tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
> 
> The "Iteration 2 dirty memory time" results are especially compelling
> because they are equivalent to running the same test with a single
> memslot. In other words, fast_page_fault performance no longer scales
> with the number of memslots.
> 
> Signed-off-by: David Matlack <dmatlack@google.com>

It's the *most* recently used slot index, of course. :)  That's true of 
lru_slot as well.

> +static inline struct kvm_memory_slot *get_slot(struct kvm_memslots *slots, int slot_index)
> +{
> +	if (slot_index < 0 || slot_index >= slots->used_slots)
> +		return NULL;
> +
> +	return &slots->memslots[slot_index];
> +}
> +

Since there are plenty of arrays inside struct kvm_memory_slot*, do we 
want to protect this against speculative out-of-bounds accesses with 
array_index_nospec?

> +static inline struct kvm_memory_slot *
> +search_memslots(struct kvm_memslots *slots, gfn_t gfn)
> +{
> +	int slot_index = __search_memslots(slots, gfn);
> +
> +	return get_slot(slots, slot_index);
>  }

Let's use this occasion to remove the duplication between 
__gfn_to_memslot and search_memslots; you can make search_memslots do 
the search and palce the LRU (ehm, MRU) code to __gfn_to_memslot only.  So:

- the new patch 1 (something like "make search_memslots search without 
LRU caching") is basically this series's patch 2, plus a tree-wide 
replacement of search_memslots with __gfn_to_memslot.

- the new patch 2 is this series's patch 1, except 
kvm_vcpu_gfn_to_memslot uses search_memslots instead of 
__search_memslots.  The comments in patch 2's commit message about the 
double misses move to this commit message.

> +	if (slot)
> +		vcpu->lru_slot_index = slot_index;

Let's call it lru_slot for consistency with the field of struct 
kvm_memory_slots.

Paolo
David Matlack Aug. 2, 2021, 4:27 p.m. UTC | #2
On Mon, Aug 02, 2021 at 04:36:24PM +0200, Paolo Bonzini wrote:
> On 31/07/21 00:37, David Matlack wrote:
> > The memslot for a given gfn is looked up multiple times during page
> > fault handling. Avoid binary searching for it multiple times by caching
> > the least recently used slot. There is an existing VM-wide LRU slot but
> > that does not work well for cases where vCPUs are accessing memory in
> > different slots (see performance data below).
> > 
> > Another benefit of caching the least recently use slot (versus looking
> > up the slot once and passing around a pointer) is speeding up memslot
> > lookups *across* faults and during spte prefetching.
> > 
> > To measure the performance of this change I ran dirty_log_perf_test with
> > 64 vCPUs and 64 memslots and measured "Populate memory time" and
> > "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
> > dirty logging to use fast_page_fault so its performance could be
> > measured.
> > 
> > Config     | Metric                        | Before | After
> > ---------- | ----------------------------- | ------ | ------
> > tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
> > tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
> > tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
> > tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
> > 
> > The "Iteration 2 dirty memory time" results are especially compelling
> > because they are equivalent to running the same test with a single
> > memslot. In other words, fast_page_fault performance no longer scales
> > with the number of memslots.
> > 
> > Signed-off-by: David Matlack <dmatlack@google.com>
> 
> It's the *most* recently used slot index, of course. :)  That's true of
> lru_slot as well.

*facepalm*

I'll include a patch in v2 to fix the name of the existing lru_slot to
something like mru_slot or last_used_slot.

> 
> > +static inline struct kvm_memory_slot *get_slot(struct kvm_memslots *slots, int slot_index)
> > +{
> > +	if (slot_index < 0 || slot_index >= slots->used_slots)
> > +		return NULL;
> > +
> > +	return &slots->memslots[slot_index];
> > +}
> > +
> 
> Since there are plenty of arrays inside struct kvm_memory_slot*, do we want
> to protect this against speculative out-of-bounds accesses with
> array_index_nospec?

I'm not sure when it makes sense to use array_index_nospec. Perhaps this
is a good candidate since vcpu->lru_slot_index and the length of
memslots[] can both be controlled by userspace.

I'll play around with adding it to see if there are any performance
trade-offs.

> 
> > +static inline struct kvm_memory_slot *
> > +search_memslots(struct kvm_memslots *slots, gfn_t gfn)
> > +{
> > +	int slot_index = __search_memslots(slots, gfn);
> > +
> > +	return get_slot(slots, slot_index);
> >  }
> 
> Let's use this occasion to remove the duplication between __gfn_to_memslot
> and search_memslots; you can make search_memslots do the search and palce
> the LRU (ehm, MRU) code to __gfn_to_memslot only.  So:
> 
> - the new patch 1 (something like "make search_memslots search without LRU
> caching") is basically this series's patch 2, plus a tree-wide replacement
> of search_memslots with __gfn_to_memslot.
> 
> - the new patch 2 is this series's patch 1, except kvm_vcpu_gfn_to_memslot
> uses search_memslots instead of __search_memslots.  The comments in patch
> 2's commit message about the double misses move to this commit message.

Will do.

> 
> > +	if (slot)
> > +		vcpu->lru_slot_index = slot_index;
> 
> Let's call it lru_slot for consistency with the field of struct
> kvm_memory_slots.

I'll make sure the two have consistent names when I rename them to get
rid of "lru".

> 
> Paolo
>
Sean Christopherson Aug. 2, 2021, 4:38 p.m. UTC | #3
On Mon, Aug 02, 2021, David Matlack wrote:
> I'll include a patch in v2 to fix the name of the existing lru_slot to
> something like mru_slot or last_used_slot.

I vote for last_used_slot, "mru" isn't exactly a common acronym, and the "mr" part
could easily be misinterpreted as Memory Region.
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d6b4ad407b8..320090d5a124 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -354,6 +354,13 @@  struct kvm_vcpu {
 	struct kvm_vcpu_stat stat;
 	char stats_id[KVM_STATS_NAME_SIZE];
 	struct kvm_dirty_ring dirty_ring;
+
+	/*
+	 * The index of the least recently used memslot by this vCPU. It's ok
+	 * if this becomes stale due to memslot changes since we always check
+	 * it is a valid slot.
+	 */
+	int lru_slot_index;
 };
 
 /* must be called with irqs disabled */
@@ -1189,27 +1196,38 @@  int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+static inline struct kvm_memory_slot *get_slot(struct kvm_memslots *slots, int slot_index)
+{
+	if (slot_index < 0 || slot_index >= slots->used_slots)
+		return NULL;
+
+	return &slots->memslots[slot_index];
+}
+
+static inline bool slot_contains_gfn(struct kvm_memslots *slots, int slot_index, gfn_t gfn)
+{
+	struct kvm_memory_slot *memslot = get_slot(slots, slot_index);
+
+	if (!memslot)
+		return false;
+
+	return gfn >= memslot->base_gfn && gfn < memslot->base_gfn + memslot->npages;
+}
+
 /*
- * search_memslots() and __gfn_to_memslot() are here because they are
- * used in non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c.
- * gfn_to_memslot() itself isn't here as an inline because that would
- * bloat other code too much.
- *
  * IMPORTANT: Slots are sorted from highest GFN to lowest GFN!
  */
-static inline struct kvm_memory_slot *
-search_memslots(struct kvm_memslots *slots, gfn_t gfn)
+static inline int __search_memslots(struct kvm_memslots *slots, gfn_t gfn)
 {
 	int start = 0, end = slots->used_slots;
 	int slot = atomic_read(&slots->lru_slot);
 	struct kvm_memory_slot *memslots = slots->memslots;
 
 	if (unlikely(!slots->used_slots))
-		return NULL;
+		return -1;
 
-	if (gfn >= memslots[slot].base_gfn &&
-	    gfn < memslots[slot].base_gfn + memslots[slot].npages)
-		return &memslots[slot];
+	if (slot_contains_gfn(slots, slot, gfn))
+		return slot;
 
 	while (start < end) {
 		slot = start + (end - start) / 2;
@@ -1220,13 +1238,26 @@  search_memslots(struct kvm_memslots *slots, gfn_t gfn)
 			start = slot + 1;
 	}
 
-	if (start < slots->used_slots && gfn >= memslots[start].base_gfn &&
-	    gfn < memslots[start].base_gfn + memslots[start].npages) {
+	if (slot_contains_gfn(slots, start, gfn)) {
 		atomic_set(&slots->lru_slot, start);
-		return &memslots[start];
+		return start;
 	}
 
-	return NULL;
+	return -1;
+}
+
+/*
+ * search_memslots() and __gfn_to_memslot() are here because they are
+ * used in non-modular code in arch/powerpc/kvm/book3s_hv_rm_mmu.c.
+ * gfn_to_memslot() itself isn't here as an inline because that would
+ * bloat other code too much.
+ */
+static inline struct kvm_memory_slot *
+search_memslots(struct kvm_memslots *slots, gfn_t gfn)
+{
+	int slot_index = __search_memslots(slots, gfn);
+
+	return get_slot(slots, slot_index);
 }
 
 static inline struct kvm_memory_slot *
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a96cbe24c688..9307594bda0c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -415,6 +415,7 @@  static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 	vcpu->preempted = false;
 	vcpu->ready = false;
 	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
+	vcpu->lru_slot_index = 0;
 }
 
 void kvm_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -2024,7 +2025,25 @@  EXPORT_SYMBOL_GPL(gfn_to_memslot);
 
 struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn)
 {
-	return __gfn_to_memslot(kvm_vcpu_memslots(vcpu), gfn);
+	struct kvm_memslots *slots = kvm_vcpu_memslots(vcpu);
+	int slot_index = vcpu->lru_slot_index;
+	struct kvm_memory_slot *slot;
+
+	if (!slot_contains_gfn(slots, slot_index, gfn))
+		slot_index = __search_memslots(slots, gfn);
+
+	slot = get_slot(slots, slot_index);
+
+	/*
+	 * Purposely avoid updating vcpu->lru_slot_index if the gfn is not
+	 * backed by memslot as that will guarantee a cache miss on the next
+	 * try. By leaving vcpu->lru_slot_index untouched we have a chance of
+	 * a hit on the next lookup.
+	 */
+	if (slot)
+		vcpu->lru_slot_index = slot_index;
+
+	return slot;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);