diff mbox series

[v3,kvm/queue,05/16] KVM: Maintain ofs_tree for fast memslot lookup by file offset

Message ID 20211223123011.41044-6-chao.p.peng@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series KVM: mm: fd-based approach for supporting KVM guest private memory | expand

Commit Message

Chao Peng Dec. 23, 2021, 12:30 p.m. UTC
Similar to hva_tree for hva range, maintain interval tree ofs_tree for
offset range of a fd-based memslot so the lookup by offset range can be
faster when memslot count is high.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
---
 include/linux/kvm_host.h |  2 ++
 virt/kvm/kvm_main.c      | 17 +++++++++++++----
 2 files changed, 15 insertions(+), 4 deletions(-)

Comments

Sean Christopherson Dec. 23, 2021, 6:02 p.m. UTC | #1
On Thu, Dec 23, 2021, Chao Peng wrote:
> Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> offset range of a fd-based memslot so the lookup by offset range can be
> faster when memslot count is high.

This won't work.  The hva_tree relies on there being exactly one virtual address
space, whereas with private memory, userspace can map multiple files into the
guest at different gfns, but with overlapping offsets.

I also dislike hijacking __kvm_handle_hva_range() in patch 07.

KVM also needs to disallow mapping the same file+offset into multiple gfns, which
I don't see anywhere in this series.

In other words, there needs to be a 1:1 gfn:file+offset mapping.  Since userspace
likely wants to allocate a single file for guest private memory and map it into
multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
of my head would be to register the notifier on a per-slot basis, not a per-VM
basis.  It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
not a huge deal.

That way, KVM's notifier callback already knows the memslot and can compute overlap
between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
Then, armed with the gfn and slot, invalidation is just a matter of constructing
a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
Chao Peng Dec. 24, 2021, 3:54 a.m. UTC | #2
On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> On Thu, Dec 23, 2021, Chao Peng wrote:
> > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > offset range of a fd-based memslot so the lookup by offset range can be
> > faster when memslot count is high.
> 
> This won't work.  The hva_tree relies on there being exactly one virtual address
> space, whereas with private memory, userspace can map multiple files into the
> guest at different gfns, but with overlapping offsets.

OK, that's the point.

> 
> I also dislike hijacking __kvm_handle_hva_range() in patch 07.
> 
> KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> I don't see anywhere in this series.

This can be checked against file+offset overlapping with existing slots
when register a new one.

> 
> In other words, there needs to be a 1:1 gfn:file+offset mapping.  Since userspace
> likely wants to allocate a single file for guest private memory and map it into
> multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> of my head would be to register the notifier on a per-slot basis, not a per-VM
> basis.  It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> not a huge deal.
> 
> That way, KVM's notifier callback already knows the memslot and can compute overlap
> between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> Then, armed with the gfn and slot, invalidation is just a matter of constructing
> a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().

KVM is easy but the kernel bits would be difficulty, it has to maintain
fd+offset to memslot mapping because one fd can have multiple memslots,
it need decide which memslot needs to be notified.

Thanks,
Chao
Yao Yuan Dec. 27, 2021, 11:50 p.m. UTC | #3
On Fri, Dec 24, 2021 at 11:54:18AM +0800, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > > offset range of a fd-based memslot so the lookup by offset range can be
> > > faster when memslot count is high.
> >
> > This won't work.  The hva_tree relies on there being exactly one virtual address
> > space, whereas with private memory, userspace can map multiple files into the
> > guest at different gfns, but with overlapping offsets.
>
> OK, that's the point.
>
> >
> > I also dislike hijacking __kvm_handle_hva_range() in patch 07.
> >
> > KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> > I don't see anywhere in this series.
>
> This can be checked against file+offset overlapping with existing slots
> when register a new one.
>
> >
> > In other words, there needs to be a 1:1 gfn:file+offset mapping.  Since userspace
> > likely wants to allocate a single file for guest private memory and map it into
> > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > basis.  It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > not a huge deal.
> >
> > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
>
> KVM is easy but the kernel bits would be difficulty, it has to maintain
> fd+offset to memslot mapping because one fd can have multiple memslots,
> it need decide which memslot needs to be notified.

How about pass "context" of fd (e.g. the gfn/hva start point) when register
the invalidation notifier to fd, then in callback kvm can convert the
offset to absolute hva/gfn with such "context", then do memslot invalidation.

>
> Thanks,
> Chao
Sean Christopherson Dec. 28, 2021, 9:48 p.m. UTC | #4
On Fri, Dec 24, 2021, Chao Peng wrote:
> On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > Similar to hva_tree for hva range, maintain interval tree ofs_tree for
> > > offset range of a fd-based memslot so the lookup by offset range can be
> > > faster when memslot count is high.
> > 
> > This won't work.  The hva_tree relies on there being exactly one virtual address
> > space, whereas with private memory, userspace can map multiple files into the
> > guest at different gfns, but with overlapping offsets.
> 
> OK, that's the point.
> 
> > 
> > I also dislike hijacking __kvm_handle_hva_range() in patch 07.
> > 
> > KVM also needs to disallow mapping the same file+offset into multiple gfns, which
> > I don't see anywhere in this series.
> 
> This can be checked against file+offset overlapping with existing slots
> when register a new one.
> 
> > 
> > In other words, there needs to be a 1:1 gfn:file+offset mapping.  Since userspace
> > likely wants to allocate a single file for guest private memory and map it into
> > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > basis.  It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > not a huge deal.
> > 
> > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
> 
> KVM is easy but the kernel bits would be difficulty, it has to maintain
> fd+offset to memslot mapping because one fd can have multiple memslots,
> it need decide which memslot needs to be notified.

No, the kernel side maintains an opaque pointer like it does today, KVM handles
reverse engineering the memslot to get the offset and whatever else it needs.
notify_fallocate() and other callbacks are unchanged, though they probably can
drop the inode.

E.g. likely with bad math and handwaving on the overlap detection:

int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
{
	struct kvm_memory_slot *slot = owner;
	struct kvm_gfn_range gfn_range = {
		.slot	   = slot,
		.start	   = (start - slot->private_offset) >> PAGE_SHIFT,
		.end	   = (end - slot->private_offset) >> PAGE_SHIFT,
		.may_block = true,
	};

	if (!has_overlap(slot, start, end))
		return 0;

	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);

	kvm_unmap_gfn_range(slot->kvm, &gfn_range);
	return 0;
}
Chao Peng Dec. 31, 2021, 2:26 a.m. UTC | #5
On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> On Fri, Dec 24, 2021, Chao Peng wrote:
> > On Thu, Dec 23, 2021 at 06:02:33PM +0000, Sean Christopherson wrote:
> > > On Thu, Dec 23, 2021, Chao Peng wrote:
> > > 
> > > In other words, there needs to be a 1:1 gfn:file+offset mapping.  Since userspace
> > > likely wants to allocate a single file for guest private memory and map it into
> > > multiple discontiguous slots, e.g. to skip the PCI hole, the best idea off the top
> > > of my head would be to register the notifier on a per-slot basis, not a per-VM
> > > basis.  It would require a 'struct kvm *' in 'struct kvm_memory_slot', but that's
> > > not a huge deal.
> > > 
> > > That way, KVM's notifier callback already knows the memslot and can compute overlap
> > > between the memslot and the range by reversing the math done by kvm_memfd_get_pfn().
> > > Then, armed with the gfn and slot, invalidation is just a matter of constructing
> > > a struct kvm_gfn_range and invoking kvm_unmap_gfn_range().
> > 
> > KVM is easy but the kernel bits would be difficulty, it has to maintain
> > fd+offset to memslot mapping because one fd can have multiple memslots,
> > it need decide which memslot needs to be notified.
> 
> No, the kernel side maintains an opaque pointer like it does today,

But the opaque pointer will now become memslot, isn't it? That said,
kernel side should maintain a list of opaque pointer (memslot) instead
of one for each fd (inode) since a fd to memslot mapping is 1:M now.

>KVM handles
> reverse engineering the memslot to get the offset and whatever else it needs.
> notify_fallocate() and other callbacks are unchanged, though they probably can
> drop the inode.
> 
> E.g. likely with bad math and handwaving on the overlap detection:
> 
> int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> {
> 	struct kvm_memory_slot *slot = owner;
> 	struct kvm_gfn_range gfn_range = {
> 		.slot	   = slot,
> 		.start	   = (start - slot->private_offset) >> PAGE_SHIFT,
> 		.end	   = (end - slot->private_offset) >> PAGE_SHIFT,
> 		.may_block = true,
> 	};
> 
> 	if (!has_overlap(slot, start, end))
> 		return 0;
> 
> 	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> 
> 	kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> 	return 0;
> }

I understand this KVM side handling, but again one fd can have multiple
memslots. How shmem decides to notify which memslot from a list of
memslots when it invokes the notify_fallocate()? Or just notify all
the possible memslots then let KVM to check? 

Thanks,
Chao
Sean Christopherson Jan. 4, 2022, 5:43 p.m. UTC | #6
On Fri, Dec 31, 2021, Chao Peng wrote:
> On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> >KVM handles
> > reverse engineering the memslot to get the offset and whatever else it needs.
> > notify_fallocate() and other callbacks are unchanged, though they probably can
> > drop the inode.
> > 
> > E.g. likely with bad math and handwaving on the overlap detection:
> > 
> > int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> > {
> > 	struct kvm_memory_slot *slot = owner;
> > 	struct kvm_gfn_range gfn_range = {
> > 		.slot	   = slot,
> > 		.start	   = (start - slot->private_offset) >> PAGE_SHIFT,
> > 		.end	   = (end - slot->private_offset) >> PAGE_SHIFT,
> > 		.may_block = true,
> > 	};
> > 
> > 	if (!has_overlap(slot, start, end))
> > 		return 0;
> > 
> > 	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > 
> > 	kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> > 	return 0;
> > }
> 
> I understand this KVM side handling, but again one fd can have multiple
> memslots. How shmem decides to notify which memslot from a list of
> memslots when it invokes the notify_fallocate()? Or just notify all
> the possible memslots then let KVM to check? 

Heh, yeah, those are the two choices.  :-)

Either the backing store needs to support registering callbacks for specific,
arbitrary ranges, or it needs to invoke all registered callbacks.  Invoking all
callbacks has my vote; it's much simpler to implement and is unlikely to incur
meaningful overhead.  _Something_ has to find the overlapping ranges, that cost
doesn't magically go away if it's pushed into the backing store.

Note, invoking all notifiers is also aligned with the mmu_notifier behavior.
Chao Peng Jan. 5, 2022, 6:09 a.m. UTC | #7
On Tue, Jan 04, 2022 at 05:43:50PM +0000, Sean Christopherson wrote:
> On Fri, Dec 31, 2021, Chao Peng wrote:
> > On Tue, Dec 28, 2021 at 09:48:08PM +0000, Sean Christopherson wrote:
> > >KVM handles
> > > reverse engineering the memslot to get the offset and whatever else it needs.
> > > notify_fallocate() and other callbacks are unchanged, though they probably can
> > > drop the inode.
> > > 
> > > E.g. likely with bad math and handwaving on the overlap detection:
> > > 
> > > int kvm_private_fd_fallocate_range(void *owner, pgoff_t start, pgoff_t end)
> > > {
> > > 	struct kvm_memory_slot *slot = owner;
> > > 	struct kvm_gfn_range gfn_range = {
> > > 		.slot	   = slot,
> > > 		.start	   = (start - slot->private_offset) >> PAGE_SHIFT,
> > > 		.end	   = (end - slot->private_offset) >> PAGE_SHIFT,
> > > 		.may_block = true,
> > > 	};
> > > 
> > > 	if (!has_overlap(slot, start, end))
> > > 		return 0;
> > > 
> > > 	gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages);
> > > 
> > > 	kvm_unmap_gfn_range(slot->kvm, &gfn_range);
> > > 	return 0;
> > > }
> > 
> > I understand this KVM side handling, but again one fd can have multiple
> > memslots. How shmem decides to notify which memslot from a list of
> > memslots when it invokes the notify_fallocate()? Or just notify all
> > the possible memslots then let KVM to check? 
> 
> Heh, yeah, those are the two choices.  :-)
> 
> Either the backing store needs to support registering callbacks for specific,
> arbitrary ranges, or it needs to invoke all registered callbacks.  Invoking all
> callbacks has my vote; it's much simpler to implement and is unlikely to incur
> meaningful overhead.  _Something_ has to find the overlapping ranges, that cost
> doesn't magically go away if it's pushed into the backing store.
> 
> Note, invoking all notifiers is also aligned with the mmu_notifier behavior.

Sounds a good reason. Then shmem side only needs to maintain a list of
users.

Chao
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2cd35560c44b..3bd875f9669f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -451,6 +451,7 @@  static inline int kvm_vcpu_exiting_guest_mode(struct kvm_vcpu *vcpu)
 struct kvm_memory_slot {
 	struct hlist_node id_node[2];
 	struct interval_tree_node hva_node[2];
+	struct interval_tree_node ofs_node[2];
 	struct rb_node gfn_node[2];
 	gfn_t base_gfn;
 	unsigned long npages;
@@ -560,6 +561,7 @@  struct kvm_memslots {
 	u64 generation;
 	atomic_long_t last_used_slot;
 	struct rb_root_cached hva_tree;
+	struct rb_root_cached ofs_tree;
 	struct rb_root gfn_tree;
 	/*
 	 * The mapping table from slot id to memslot.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b0f7e6eb00ff..47e96d1eb233 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1087,6 +1087,7 @@  static struct kvm *kvm_create_vm(unsigned long type)
 
 			atomic_long_set(&slots->last_used_slot, (unsigned long)NULL);
 			slots->hva_tree = RB_ROOT_CACHED;
+			slots->ofs_tree = RB_ROOT_CACHED;
 			slots->gfn_tree = RB_ROOT;
 			hash_init(slots->id_hash);
 			slots->node_idx = j;
@@ -1363,7 +1364,7 @@  static void kvm_replace_gfn_node(struct kvm_memslots *slots,
  * With NULL @old this simply adds @new.
  * With NULL @new this simply removes @old.
  *
- * If @new is non-NULL its hva_node[slots_idx] range has to be set
+ * If @new is non-NULL its hva/ofs_node[slots_idx] range has to be set
  * appropriately.
  */
 static void kvm_replace_memslot(struct kvm *kvm,
@@ -1377,6 +1378,7 @@  static void kvm_replace_memslot(struct kvm *kvm,
 	if (old) {
 		hash_del(&old->id_node[idx]);
 		interval_tree_remove(&old->hva_node[idx], &slots->hva_tree);
+		interval_tree_remove(&old->ofs_node[idx], &slots->ofs_tree);
 
 		if ((long)old == atomic_long_read(&slots->last_used_slot))
 			atomic_long_set(&slots->last_used_slot, (long)new);
@@ -1388,20 +1390,27 @@  static void kvm_replace_memslot(struct kvm *kvm,
 	}
 
 	/*
-	 * Initialize @new's hva range.  Do this even when replacing an @old
+	 * Initialize @new's hva/ofs range.  Do this even when replacing an @old
 	 * slot, kvm_copy_memslot() deliberately does not touch node data.
 	 */
 	new->hva_node[idx].start = new->userspace_addr;
 	new->hva_node[idx].last = new->userspace_addr +
 				  (new->npages << PAGE_SHIFT) - 1;
+	if (kvm_slot_is_private(new)) {
+		new->ofs_node[idx].start = new->ofs;
+		new->ofs_node[idx].last = new->ofs +
+					  (new->npages << PAGE_SHIFT) - 1;
+	}
 
 	/*
 	 * (Re)Add the new memslot.  There is no O(1) interval_tree_replace(),
-	 * hva_node needs to be swapped with remove+insert even though hva can't
-	 * change when replacing an existing slot.
+	 * hva_node/ofs_node needs to be swapped with remove+insert even though
+	 * hva/ofs can't change when replacing an existing slot.
 	 */
 	hash_add(slots->id_hash, &new->id_node[idx], new->id);
 	interval_tree_insert(&new->hva_node[idx], &slots->hva_tree);
+	if (kvm_slot_is_private(new))
+		interval_tree_insert(&new->ofs_node[idx], &slots->ofs_tree);
 
 	/*
 	 * If the memslot gfn is unchanged, rb_replace_node() can be used to