diff mbox series

[v3,2/6] KVM: Add KVM_CREATE_COALESCED_MMIO_BUFFER ioctl

Message ID 20240820133333.1724191-3-ilstam@amazon.com (mailing list archive)
State New, archived
Headers show
Series KVM: Improve MMIO Coalescing API | expand

Commit Message

Ilias Stamatis Aug. 20, 2024, 1:33 p.m. UTC
The current MMIO coalescing design has a few drawbacks which limit its
usefulness. Currently all coalesced MMIO zones use the same ring buffer.
That means that upon a userspace exit we have to handle potentially
unrelated MMIO writes synchronously. And a VM-wide lock needs to be
taken in the kernel when an MMIO exit occurs.

Additionally, there is no direct way for userspace to be notified about
coalesced MMIO writes. If the next MMIO exit to userspace is when the
ring buffer has filled then a substantial (and unbounded) amount of time
may have passed since the first coalesced MMIO.

Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply
returns a file descriptor to the caller but does not allocate a ring
buffer. Userspace can then pass this fd to mmap() to actually allocate a
buffer and map it to its address space.

Subsequent patches will allow userspace to:

- Associate the fd with a coalescing zone when registering it so that
  writes to that zone are accumulated in that specific ring buffer
  rather than the VM-wide one.
- Poll for MMIO writes using this fd.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
---

v2->v3: 
  - Removed unnecessary brackets in a switch case and adjusted
    indentation on a spinlock as suggested by Sean Christopherson

 include/linux/kvm_host.h  |   1 +
 include/uapi/linux/kvm.h  |   2 +
 virt/kvm/coalesced_mmio.c | 141 +++++++++++++++++++++++++++++++++++---
 virt/kvm/coalesced_mmio.h |   9 +++
 virt/kvm/kvm_main.c       |   3 +
 5 files changed, 148 insertions(+), 8 deletions(-)

Comments

Sean Christopherson Aug. 28, 2024, 2:25 p.m. UTC | #1
On Tue, Aug 20, 2024, Ilias Stamatis wrote:
> The current MMIO coalescing design has a few drawbacks which limit its
> usefulness. Currently all coalesced MMIO zones use the same ring buffer.
> That means that upon a userspace exit we have to handle potentially
> unrelated MMIO writes synchronously. And a VM-wide lock needs to be
> taken in the kernel when an MMIO exit occurs.
> 
> Additionally, there is no direct way for userspace to be notified about
> coalesced MMIO writes. If the next MMIO exit to userspace is when the
> ring buffer has filled then a substantial (and unbounded) amount of time
> may have passed since the first coalesced MMIO.
> 
> Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply

Why allocate the buffer in KVM?  It allows reusing coalesced_mmio_write() without
needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a
bit in fact) to KVM and inherits many of the restrictions and warts of the existing
coalesced I/O implementation.

E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel
page size.

Why not go with a BYOB model?  E.g. so that userspace can explicitly define the
buffer size to best fit the properties of the I/O range, and to avoid additional
complexity in KVM.

> returns a file descriptor to the caller but does not allocate a ring
> buffer. Userspace can then pass this fd to mmap() to actually allocate a
> buffer and map it to its address space.
> 
> Subsequent patches will allow userspace to:
> 
> - Associate the fd with a coalescing zone when registering it so that
>   writes to that zone are accumulated in that specific ring buffer
>   rather than the VM-wide one.
> - Poll for MMIO writes using this fd.

Why?  I get the desire for a doorbell, but KVM already supports "fast" I/O for
doorbells.  The only use case I can think of is where the doorbell sits in the
same region as the data payload, but that essentially just saves an entry in
KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of
the region).

> 
> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> Reviewed-by: Paul Durrant <paul@xen.org>
> ---

...


> +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma)
> +{
> +	struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data;
> +
> +	spin_lock(&dev->ring_lock);
> +
> +	vfree(dev->ring);
> +	dev->ring = NULL;
> +
> +	spin_unlock(&dev->ring_lock);

I doubt this is correct.  I don't see VM_DONTCOPY, and so userspace can create a
second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs
is closed.

> +}
> +
> +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = {
> +	.close = coalesced_mmio_buffer_vma_close,
> +};
> +
> +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data;
> +	unsigned long pfn;
> +	int ret = 0;
> +
> +	spin_lock(&dev->ring_lock);
> +
> +	if (dev->ring) {

Restricting userspace to a single mmap() is a very bizarre restriction.

> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}
Stamatis, Ilias Aug. 29, 2024, 11:20 a.m. UTC | #2
On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote:
> On Tue, Aug 20, 2024, Ilias Stamatis wrote:
> > The current MMIO coalescing design has a few drawbacks which limit its
> > usefulness. Currently all coalesced MMIO zones use the same ring buffer.
> > That means that upon a userspace exit we have to handle potentially
> > unrelated MMIO writes synchronously. And a VM-wide lock needs to be
> > taken in the kernel when an MMIO exit occurs.
> > 
> > Additionally, there is no direct way for userspace to be notified about
> > coalesced MMIO writes. If the next MMIO exit to userspace is when the
> > ring buffer has filled then a substantial (and unbounded) amount of time
> > may have passed since the first coalesced MMIO.
> > 
> > Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply
> 
> Why allocate the buffer in KVM?  It allows reusing coalesced_mmio_write() without
> needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a
> bit in fact) to KVM and inherits many of the restrictions and warts of the existing
> coalesced I/O implementation.
> 
> E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel
> page size.
> 
> Why not go with a BYOB model?  E.g. so that userspace can explicitly define the
> buffer size to best fit the properties of the I/O range, and to avoid additional
> complexity in KVM.

Fair enough. I used the original implementation as a guide, but let's
do it Right™ this time.

> 
> > returns a file descriptor to the caller but does not allocate a ring
> > buffer. Userspace can then pass this fd to mmap() to actually allocate a
> > buffer and map it to its address space.
> > 
> > Subsequent patches will allow userspace to:
> > 
> > - Associate the fd with a coalescing zone when registering it so that
> >   writes to that zone are accumulated in that specific ring buffer
> >   rather than the VM-wide one.
> > - Poll for MMIO writes using this fd.
> 
> Why?  I get the desire for a doorbell, but KVM already supports "fast" I/O for
> doorbells. 

What do you refer to here? ioeventfd? 

Correct me if I am wrong, but my understanding is that with an
ioeventfd the write value is not available. And that is a problem when
that value is a head or tail pointer. 

I also realised that I need to improve my commit messages.

>  The only use case I can think of is where the doorbell sits in the
> same region as the data payload, but that essentially just saves an entry in
> KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of
> the region).
> 
> > 
> > Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
> > Reviewed-by: Paul Durrant <paul@xen.org>
> > ---
> 
> ...
> 
> 
> > +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma)
> > +{
> > +	struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data;
> > +
> > +	spin_lock(&dev->ring_lock);
> > +
> > +	vfree(dev->ring);
> > +	dev->ring = NULL;
> > +
> > +	spin_unlock(&dev->ring_lock);
> 
> I doubt this is correct.  I don't see VM_DONTCOPY, and so userspace can create a
> second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs
> is closed.

Oops. Will fix.

> 
> > +}
> > +
> > +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = {
> > +	.close = coalesced_mmio_buffer_vma_close,
> > +};
> > +
> > +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +	struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data;
> > +	unsigned long pfn;
> > +	int ret = 0;
> > +
> > +	spin_lock(&dev->ring_lock);
> > +
> > +	if (dev->ring) {
> 
> Restricting userspace to a single mmap() is a very bizarre restriction.
> 

Hmm, you are right. Even if the buffer is allocated in the kernel, and
one was allocated already we should let the user map it to multiple
addresses. I will fix that too.

> > +		ret = -EBUSY;
> > +		goto out_unlock;
> > +	}

Thanks for the review.

Ilias
Sean Christopherson Aug. 29, 2024, 2:55 p.m. UTC | #3
On Thu, Aug 29, 2024, Ilias Stamatis wrote:
> On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote:
> > > returns a file descriptor to the caller but does not allocate a ring
> > > buffer. Userspace can then pass this fd to mmap() to actually allocate a
> > > buffer and map it to its address space.
> > > 
> > > Subsequent patches will allow userspace to:
> > > 
> > > - Associate the fd with a coalescing zone when registering it so that
> > >   writes to that zone are accumulated in that specific ring buffer
> > >   rather than the VM-wide one.
> > > - Poll for MMIO writes using this fd.
> > 
> > Why?  I get the desire for a doorbell, but KVM already supports "fast" I/O for
> > doorbells. 
> 
> What do you refer to here? ioeventfd? 

Ya.

> Correct me if I am wrong, but my understanding is that with an
> ioeventfd the write value is not available. And that is a problem when
> that value is a head or tail pointer. 

Ah.  Can you describe (or point at) an example device?  I don't read many device
specs (understatement).  It would be helpful to have a concrete use case for
reviewing the design itself.

In a perfect world, poll() support would come from a third party file type, as
this doesn't seem _that_ unique (to KVM).  But AFAICT there isn't an existing
type that is a good fit, probably because it's such a simple thing.
Stamatis, Ilias Aug. 29, 2024, 5:42 p.m. UTC | #4
On Thu, 2024-08-29 at 07:55 -0700, Sean Christopherson wrote:
> On Thu, Aug 29, 2024, Ilias Stamatis wrote:
> > On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote:
> > > > returns a file descriptor to the caller but does not allocate a ring
> > > > buffer. Userspace can then pass this fd to mmap() to actually allocate a
> > > > buffer and map it to its address space.
> > > > 
> > > > Subsequent patches will allow userspace to:
> > > > 
> > > > - Associate the fd with a coalescing zone when registering it so that
> > > >   writes to that zone are accumulated in that specific ring buffer
> > > >   rather than the VM-wide one.
> > > > - Poll for MMIO writes using this fd.
> > > 
> > > Why?  I get the desire for a doorbell, but KVM already supports "fast" I/O for
> > > doorbells. 
> > 
> > What do you refer to here? ioeventfd? 
> 
> Ya.
> 
> > Correct me if I am wrong, but my understanding is that with an
> > ioeventfd the write value is not available. And that is a problem when
> > that value is a head or tail pointer. 
> 
> Ah.  Can you describe (or point at) an example device?  I don't read many device
> specs (understatement).  It would be helpful to have a concrete use case for
> reviewing the design itself.

Intel 82599 VF is one such example. I believe NVMe is another example.

> 
> In a perfect world, poll() support would come from a third party file type, as
> this doesn't seem _that_ unique (to KVM).  But AFAICT there isn't an existing
> type that is a good fit, probably because it's such a simple thing.
diff mbox series

Patch

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ed0520268de4..efb07422e76b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -808,6 +808,7 @@  struct kvm {
 	struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;
 	spinlock_t ring_lock;
 	struct list_head coalesced_zones;
+	struct list_head coalesced_buffers;
 #endif
 
 	struct mutex irq_lock;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..87f79a820fc0 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1573,4 +1573,6 @@  struct kvm_pre_fault_memory {
 	__u64 padding[5];
 };
 
+#define KVM_CREATE_COALESCED_MMIO_BUFFER _IO(KVMIO,   0xd6)
+
 #endif /* __LINUX_KVM_H */
diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c
index 184c5c40c9c1..98b7e8760aa7 100644
--- a/virt/kvm/coalesced_mmio.c
+++ b/virt/kvm/coalesced_mmio.c
@@ -4,6 +4,7 @@ 
  *
  * Copyright (c) 2008 Bull S.A.S.
  * Copyright 2009 Red Hat, Inc. and/or its affiliates.
+ * Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved.
  *
  *  Author: Laurent Vivier <Laurent.Vivier@bull.net>
  *
@@ -14,6 +15,7 @@ 
 #include <linux/kvm_host.h>
 #include <linux/slab.h>
 #include <linux/kvm.h>
+#include <linux/anon_inodes.h>
 
 #include "coalesced_mmio.h"
 
@@ -40,17 +42,14 @@  static int coalesced_mmio_in_range(struct kvm_coalesced_mmio_dev *dev,
 	return 1;
 }
 
-static int coalesced_mmio_has_room(struct kvm_coalesced_mmio_dev *dev, u32 last)
+static int coalesced_mmio_has_room(struct kvm_coalesced_mmio_ring *ring, u32 last)
 {
-	struct kvm_coalesced_mmio_ring *ring;
-
 	/* Are we able to batch it ? */
 
 	/* last is the first free entry
 	 * check if we don't meet the first used entry
 	 * there is always one unused entry in the buffer
 	 */
-	ring = dev->kvm->coalesced_mmio_ring;
 	if ((last + 1) % KVM_COALESCED_MMIO_MAX == READ_ONCE(ring->first)) {
 		/* full */
 		return 0;
@@ -65,17 +64,27 @@  static int coalesced_mmio_write(struct kvm_vcpu *vcpu,
 {
 	struct kvm_coalesced_mmio_dev *dev = to_mmio(this);
 	struct kvm_coalesced_mmio_ring *ring = dev->kvm->coalesced_mmio_ring;
+	spinlock_t *lock = dev->buffer_dev ? &dev->buffer_dev->ring_lock :
+					     &dev->kvm->ring_lock;
 	__u32 insert;
 
 	if (!coalesced_mmio_in_range(dev, addr, len))
 		return -EOPNOTSUPP;
 
-	spin_lock(&dev->kvm->ring_lock);
+	spin_lock(lock);
+
+	if (dev->buffer_dev) {
+		ring = dev->buffer_dev->ring;
+		if (!ring) {
+			spin_unlock(lock);
+			return -EOPNOTSUPP;
+		}
+	}
 
 	insert = READ_ONCE(ring->last);
-	if (!coalesced_mmio_has_room(dev, insert) ||
+	if (!coalesced_mmio_has_room(ring, insert) ||
 	    insert >= KVM_COALESCED_MMIO_MAX) {
-		spin_unlock(&dev->kvm->ring_lock);
+		spin_unlock(lock);
 		return -EOPNOTSUPP;
 	}
 
@@ -87,7 +96,7 @@  static int coalesced_mmio_write(struct kvm_vcpu *vcpu,
 	ring->coalesced_mmio[insert].pio = dev->zone.pio;
 	smp_wmb();
 	ring->last = (insert + 1) % KVM_COALESCED_MMIO_MAX;
-	spin_unlock(&dev->kvm->ring_lock);
+	spin_unlock(lock);
 	return 0;
 }
 
@@ -122,6 +131,7 @@  int kvm_coalesced_mmio_init(struct kvm *kvm)
 	 */
 	spin_lock_init(&kvm->ring_lock);
 	INIT_LIST_HEAD(&kvm->coalesced_zones);
+	INIT_LIST_HEAD(&kvm->coalesced_buffers);
 
 	return 0;
 }
@@ -132,11 +142,125 @@  void kvm_coalesced_mmio_free(struct kvm *kvm)
 		free_page((unsigned long)kvm->coalesced_mmio_ring);
 }
 
+static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma)
+{
+	struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data;
+
+	spin_lock(&dev->ring_lock);
+
+	vfree(dev->ring);
+	dev->ring = NULL;
+
+	spin_unlock(&dev->ring_lock);
+}
+
+static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = {
+	.close = coalesced_mmio_buffer_vma_close,
+};
+
+static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data;
+	unsigned long pfn;
+	int ret = 0;
+
+	spin_lock(&dev->ring_lock);
+
+	if (dev->ring) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	dev->ring = vmalloc_user(PAGE_SIZE);
+	if (!dev->ring) {
+		ret = -ENOMEM;
+		goto out_unlock;
+	}
+
+	pfn = vmalloc_to_pfn(dev->ring);
+
+	if (remap_pfn_range(vma, vma->vm_start, pfn, PAGE_SIZE,
+			    vma->vm_page_prot)) {
+		vfree(dev->ring);
+		dev->ring = NULL;
+		ret = -EAGAIN;
+		goto out_unlock;
+	}
+
+	vma->vm_ops = &coalesced_mmio_buffer_vm_ops;
+	vma->vm_private_data = dev;
+
+out_unlock:
+	spin_unlock(&dev->ring_lock);
+
+	return ret;
+}
+
+static int coalesced_mmio_buffer_release(struct inode *inode, struct file *file)
+{
+
+	struct kvm_coalesced_mmio_buffer_dev *buffer_dev = file->private_data;
+	struct kvm_coalesced_mmio_dev *mmio_dev, *tmp;
+	struct kvm *kvm = buffer_dev->kvm;
+
+	/* Deregister all zones associated with this ring buffer */
+	mutex_lock(&kvm->slots_lock);
+
+	list_for_each_entry_safe(mmio_dev, tmp, &kvm->coalesced_zones, list) {
+		if (mmio_dev->buffer_dev == buffer_dev) {
+			if (kvm_io_bus_unregister_dev(kvm,
+			    mmio_dev->zone.pio ? KVM_PIO_BUS : KVM_MMIO_BUS,
+			    &mmio_dev->dev))
+				break;
+		}
+	}
+
+	list_del(&buffer_dev->list);
+	kfree(buffer_dev);
+
+	mutex_unlock(&kvm->slots_lock);
+
+	return 0;
+}
+
+static const struct file_operations coalesced_mmio_buffer_ops = {
+	.mmap = coalesced_mmio_buffer_mmap,
+	.release = coalesced_mmio_buffer_release,
+};
+
+int kvm_vm_ioctl_create_coalesced_mmio_buffer(struct kvm *kvm)
+{
+	int ret;
+	struct kvm_coalesced_mmio_buffer_dev *dev;
+
+	dev = kzalloc(sizeof(struct kvm_coalesced_mmio_buffer_dev),
+		      GFP_KERNEL_ACCOUNT);
+	if (!dev)
+		return -ENOMEM;
+
+	dev->kvm = kvm;
+	spin_lock_init(&dev->ring_lock);
+
+	ret = anon_inode_getfd("coalesced_mmio_buf", &coalesced_mmio_buffer_ops,
+			       dev, O_RDWR | O_CLOEXEC);
+	if (ret < 0) {
+		kfree(dev);
+		return ret;
+	}
+
+	mutex_lock(&kvm->slots_lock);
+	list_add_tail(&dev->list, &kvm->coalesced_buffers);
+	mutex_unlock(&kvm->slots_lock);
+
+	return ret;
+}
+
 int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
 					 struct kvm_coalesced_mmio_zone *zone)
 {
 	int ret;
 	struct kvm_coalesced_mmio_dev *dev;
+	struct kvm_coalesced_mmio_buffer_dev *buffer_dev = NULL;
 
 	if (zone->pio != 1 && zone->pio != 0)
 		return -EINVAL;
@@ -149,6 +273,7 @@  int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
 	kvm_iodevice_init(&dev->dev, &coalesced_mmio_ops);
 	dev->kvm = kvm;
 	dev->zone = *zone;
+	dev->buffer_dev = buffer_dev;
 
 	mutex_lock(&kvm->slots_lock);
 	ret = kvm_io_bus_register_dev(kvm,
diff --git a/virt/kvm/coalesced_mmio.h b/virt/kvm/coalesced_mmio.h
index 36f84264ed25..37d9d8f325bb 100644
--- a/virt/kvm/coalesced_mmio.h
+++ b/virt/kvm/coalesced_mmio.h
@@ -20,6 +20,14 @@  struct kvm_coalesced_mmio_dev {
 	struct kvm_io_device dev;
 	struct kvm *kvm;
 	struct kvm_coalesced_mmio_zone zone;
+	struct kvm_coalesced_mmio_buffer_dev *buffer_dev;
+};
+
+struct kvm_coalesced_mmio_buffer_dev {
+	struct list_head list;
+	struct kvm *kvm;
+	spinlock_t ring_lock;
+	struct kvm_coalesced_mmio_ring *ring;
 };
 
 int kvm_coalesced_mmio_init(struct kvm *kvm);
@@ -28,6 +36,7 @@  int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm,
 					struct kvm_coalesced_mmio_zone *zone);
 int kvm_vm_ioctl_unregister_coalesced_mmio(struct kvm *kvm,
 					struct kvm_coalesced_mmio_zone *zone);
+int kvm_vm_ioctl_create_coalesced_mmio_buffer(struct kvm *kvm);
 
 #else
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 238940c3cb32..9f6ad6e03317 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5244,6 +5244,9 @@  static long kvm_vm_ioctl(struct file *filp,
 		r = kvm_vm_ioctl_unregister_coalesced_mmio(kvm, &zone);
 		break;
 	}
+	case KVM_CREATE_COALESCED_MMIO_BUFFER:
+		r = kvm_vm_ioctl_create_coalesced_mmio_buffer(kvm);
+		break;
 #endif
 	case KVM_IRQFD: {
 		struct kvm_irqfd data;