Message ID | 20240820133333.1724191-3-ilstam@amazon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | KVM: Improve MMIO Coalescing API | expand |
On Tue, Aug 20, 2024, Ilias Stamatis wrote: > The current MMIO coalescing design has a few drawbacks which limit its > usefulness. Currently all coalesced MMIO zones use the same ring buffer. > That means that upon a userspace exit we have to handle potentially > unrelated MMIO writes synchronously. And a VM-wide lock needs to be > taken in the kernel when an MMIO exit occurs. > > Additionally, there is no direct way for userspace to be notified about > coalesced MMIO writes. If the next MMIO exit to userspace is when the > ring buffer has filled then a substantial (and unbounded) amount of time > may have passed since the first coalesced MMIO. > > Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply Why allocate the buffer in KVM? It allows reusing coalesced_mmio_write() without needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a bit in fact) to KVM and inherits many of the restrictions and warts of the existing coalesced I/O implementation. E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel page size. Why not go with a BYOB model? E.g. so that userspace can explicitly define the buffer size to best fit the properties of the I/O range, and to avoid additional complexity in KVM. > returns a file descriptor to the caller but does not allocate a ring > buffer. Userspace can then pass this fd to mmap() to actually allocate a > buffer and map it to its address space. > > Subsequent patches will allow userspace to: > > - Associate the fd with a coalescing zone when registering it so that > writes to that zone are accumulated in that specific ring buffer > rather than the VM-wide one. > - Poll for MMIO writes using this fd. Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for doorbells. The only use case I can think of is where the doorbell sits in the same region as the data payload, but that essentially just saves an entry in KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of the region). > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com> > Reviewed-by: Paul Durrant <paul@xen.org> > --- ... > +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma) > +{ > + struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data; > + > + spin_lock(&dev->ring_lock); > + > + vfree(dev->ring); > + dev->ring = NULL; > + > + spin_unlock(&dev->ring_lock); I doubt this is correct. I don't see VM_DONTCOPY, and so userspace can create a second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs is closed. > +} > + > +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = { > + .close = coalesced_mmio_buffer_vma_close, > +}; > + > +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma) > +{ > + struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data; > + unsigned long pfn; > + int ret = 0; > + > + spin_lock(&dev->ring_lock); > + > + if (dev->ring) { Restricting userspace to a single mmap() is a very bizarre restriction. > + ret = -EBUSY; > + goto out_unlock; > + }
On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote: > On Tue, Aug 20, 2024, Ilias Stamatis wrote: > > The current MMIO coalescing design has a few drawbacks which limit its > > usefulness. Currently all coalesced MMIO zones use the same ring buffer. > > That means that upon a userspace exit we have to handle potentially > > unrelated MMIO writes synchronously. And a VM-wide lock needs to be > > taken in the kernel when an MMIO exit occurs. > > > > Additionally, there is no direct way for userspace to be notified about > > coalesced MMIO writes. If the next MMIO exit to userspace is when the > > ring buffer has filled then a substantial (and unbounded) amount of time > > may have passed since the first coalesced MMIO. > > > > Add a KVM_CREATE_COALESCED_MMIO_BUFFER ioctl to KVM. This ioctl simply > > Why allocate the buffer in KVM? It allows reusing coalesced_mmio_write() without > needing {get,put}_user() or any form of kmap(), but it adds complexity (quite a > bit in fact) to KVM and inherits many of the restrictions and warts of the existing > coalesced I/O implementation. > > E.g. the size of the ring buffer is "fixed", yet variable based on the host kernel > page size. > > Why not go with a BYOB model? E.g. so that userspace can explicitly define the > buffer size to best fit the properties of the I/O range, and to avoid additional > complexity in KVM. Fair enough. I used the original implementation as a guide, but let's do it Right™ this time. > > > returns a file descriptor to the caller but does not allocate a ring > > buffer. Userspace can then pass this fd to mmap() to actually allocate a > > buffer and map it to its address space. > > > > Subsequent patches will allow userspace to: > > > > - Associate the fd with a coalescing zone when registering it so that > > writes to that zone are accumulated in that specific ring buffer > > rather than the VM-wide one. > > - Poll for MMIO writes using this fd. > > Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for > doorbells. What do you refer to here? ioeventfd? Correct me if I am wrong, but my understanding is that with an ioeventfd the write value is not available. And that is a problem when that value is a head or tail pointer. I also realised that I need to improve my commit messages. > The only use case I can think of is where the doorbell sits in the > same region as the data payload, but that essentially just saves an entry in > KVM's MMIO bus (or I suppose two entries if the doorbell is in the middle of > the region). > > > > > Signed-off-by: Ilias Stamatis <ilstam@amazon.com> > > Reviewed-by: Paul Durrant <paul@xen.org> > > --- > > ... > > > > +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma) > > +{ > > + struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data; > > + > > + spin_lock(&dev->ring_lock); > > + > > + vfree(dev->ring); > > + dev->ring = NULL; > > + > > + spin_unlock(&dev->ring_lock); > > I doubt this is correct. I don't see VM_DONTCOPY, and so userspace can create a > second (or more) VMA by forking, and then KVM will hit a UAF if any of the VMAs > is closed. Oops. Will fix. > > > +} > > + > > +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = { > > + .close = coalesced_mmio_buffer_vma_close, > > +}; > > + > > +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma) > > +{ > > + struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data; > > + unsigned long pfn; > > + int ret = 0; > > + > > + spin_lock(&dev->ring_lock); > > + > > + if (dev->ring) { > > Restricting userspace to a single mmap() is a very bizarre restriction. > Hmm, you are right. Even if the buffer is allocated in the kernel, and one was allocated already we should let the user map it to multiple addresses. I will fix that too. > > + ret = -EBUSY; > > + goto out_unlock; > > + } Thanks for the review. Ilias
On Thu, Aug 29, 2024, Ilias Stamatis wrote: > On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote: > > > returns a file descriptor to the caller but does not allocate a ring > > > buffer. Userspace can then pass this fd to mmap() to actually allocate a > > > buffer and map it to its address space. > > > > > > Subsequent patches will allow userspace to: > > > > > > - Associate the fd with a coalescing zone when registering it so that > > > writes to that zone are accumulated in that specific ring buffer > > > rather than the VM-wide one. > > > - Poll for MMIO writes using this fd. > > > > Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for > > doorbells. > > What do you refer to here? ioeventfd? Ya. > Correct me if I am wrong, but my understanding is that with an > ioeventfd the write value is not available. And that is a problem when > that value is a head or tail pointer. Ah. Can you describe (or point at) an example device? I don't read many device specs (understatement). It would be helpful to have a concrete use case for reviewing the design itself. In a perfect world, poll() support would come from a third party file type, as this doesn't seem _that_ unique (to KVM). But AFAICT there isn't an existing type that is a good fit, probably because it's such a simple thing.
On Thu, 2024-08-29 at 07:55 -0700, Sean Christopherson wrote: > On Thu, Aug 29, 2024, Ilias Stamatis wrote: > > On Wed, 2024-08-28 at 07:25 -0700, Sean Christopherson wrote: > > > > returns a file descriptor to the caller but does not allocate a ring > > > > buffer. Userspace can then pass this fd to mmap() to actually allocate a > > > > buffer and map it to its address space. > > > > > > > > Subsequent patches will allow userspace to: > > > > > > > > - Associate the fd with a coalescing zone when registering it so that > > > > writes to that zone are accumulated in that specific ring buffer > > > > rather than the VM-wide one. > > > > - Poll for MMIO writes using this fd. > > > > > > Why? I get the desire for a doorbell, but KVM already supports "fast" I/O for > > > doorbells. > > > > What do you refer to here? ioeventfd? > > Ya. > > > Correct me if I am wrong, but my understanding is that with an > > ioeventfd the write value is not available. And that is a problem when > > that value is a head or tail pointer. > > Ah. Can you describe (or point at) an example device? I don't read many device > specs (understatement). It would be helpful to have a concrete use case for > reviewing the design itself. Intel 82599 VF is one such example. I believe NVMe is another example. > > In a perfect world, poll() support would come from a third party file type, as > this doesn't seem _that_ unique (to KVM). But AFAICT there isn't an existing > type that is a good fit, probably because it's such a simple thing.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ed0520268de4..efb07422e76b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -808,6 +808,7 @@ struct kvm { struct kvm_coalesced_mmio_ring *coalesced_mmio_ring; spinlock_t ring_lock; struct list_head coalesced_zones; + struct list_head coalesced_buffers; #endif struct mutex irq_lock; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 637efc055145..87f79a820fc0 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1573,4 +1573,6 @@ struct kvm_pre_fault_memory { __u64 padding[5]; }; +#define KVM_CREATE_COALESCED_MMIO_BUFFER _IO(KVMIO, 0xd6) + #endif /* __LINUX_KVM_H */ diff --git a/virt/kvm/coalesced_mmio.c b/virt/kvm/coalesced_mmio.c index 184c5c40c9c1..98b7e8760aa7 100644 --- a/virt/kvm/coalesced_mmio.c +++ b/virt/kvm/coalesced_mmio.c @@ -4,6 +4,7 @@ * * Copyright (c) 2008 Bull S.A.S. * Copyright 2009 Red Hat, Inc. and/or its affiliates. + * Copyright 2024 Amazon.com, Inc. or its affiliates. All Rights Reserved. * * Author: Laurent Vivier <Laurent.Vivier@bull.net> * @@ -14,6 +15,7 @@ #include <linux/kvm_host.h> #include <linux/slab.h> #include <linux/kvm.h> +#include <linux/anon_inodes.h> #include "coalesced_mmio.h" @@ -40,17 +42,14 @@ static int coalesced_mmio_in_range(struct kvm_coalesced_mmio_dev *dev, return 1; } -static int coalesced_mmio_has_room(struct kvm_coalesced_mmio_dev *dev, u32 last) +static int coalesced_mmio_has_room(struct kvm_coalesced_mmio_ring *ring, u32 last) { - struct kvm_coalesced_mmio_ring *ring; - /* Are we able to batch it ? */ /* last is the first free entry * check if we don't meet the first used entry * there is always one unused entry in the buffer */ - ring = dev->kvm->coalesced_mmio_ring; if ((last + 1) % KVM_COALESCED_MMIO_MAX == READ_ONCE(ring->first)) { /* full */ return 0; @@ -65,17 +64,27 @@ static int coalesced_mmio_write(struct kvm_vcpu *vcpu, { struct kvm_coalesced_mmio_dev *dev = to_mmio(this); struct kvm_coalesced_mmio_ring *ring = dev->kvm->coalesced_mmio_ring; + spinlock_t *lock = dev->buffer_dev ? &dev->buffer_dev->ring_lock : + &dev->kvm->ring_lock; __u32 insert; if (!coalesced_mmio_in_range(dev, addr, len)) return -EOPNOTSUPP; - spin_lock(&dev->kvm->ring_lock); + spin_lock(lock); + + if (dev->buffer_dev) { + ring = dev->buffer_dev->ring; + if (!ring) { + spin_unlock(lock); + return -EOPNOTSUPP; + } + } insert = READ_ONCE(ring->last); - if (!coalesced_mmio_has_room(dev, insert) || + if (!coalesced_mmio_has_room(ring, insert) || insert >= KVM_COALESCED_MMIO_MAX) { - spin_unlock(&dev->kvm->ring_lock); + spin_unlock(lock); return -EOPNOTSUPP; } @@ -87,7 +96,7 @@ static int coalesced_mmio_write(struct kvm_vcpu *vcpu, ring->coalesced_mmio[insert].pio = dev->zone.pio; smp_wmb(); ring->last = (insert + 1) % KVM_COALESCED_MMIO_MAX; - spin_unlock(&dev->kvm->ring_lock); + spin_unlock(lock); return 0; } @@ -122,6 +131,7 @@ int kvm_coalesced_mmio_init(struct kvm *kvm) */ spin_lock_init(&kvm->ring_lock); INIT_LIST_HEAD(&kvm->coalesced_zones); + INIT_LIST_HEAD(&kvm->coalesced_buffers); return 0; } @@ -132,11 +142,125 @@ void kvm_coalesced_mmio_free(struct kvm *kvm) free_page((unsigned long)kvm->coalesced_mmio_ring); } +static void coalesced_mmio_buffer_vma_close(struct vm_area_struct *vma) +{ + struct kvm_coalesced_mmio_buffer_dev *dev = vma->vm_private_data; + + spin_lock(&dev->ring_lock); + + vfree(dev->ring); + dev->ring = NULL; + + spin_unlock(&dev->ring_lock); +} + +static const struct vm_operations_struct coalesced_mmio_buffer_vm_ops = { + .close = coalesced_mmio_buffer_vma_close, +}; + +static int coalesced_mmio_buffer_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct kvm_coalesced_mmio_buffer_dev *dev = file->private_data; + unsigned long pfn; + int ret = 0; + + spin_lock(&dev->ring_lock); + + if (dev->ring) { + ret = -EBUSY; + goto out_unlock; + } + + dev->ring = vmalloc_user(PAGE_SIZE); + if (!dev->ring) { + ret = -ENOMEM; + goto out_unlock; + } + + pfn = vmalloc_to_pfn(dev->ring); + + if (remap_pfn_range(vma, vma->vm_start, pfn, PAGE_SIZE, + vma->vm_page_prot)) { + vfree(dev->ring); + dev->ring = NULL; + ret = -EAGAIN; + goto out_unlock; + } + + vma->vm_ops = &coalesced_mmio_buffer_vm_ops; + vma->vm_private_data = dev; + +out_unlock: + spin_unlock(&dev->ring_lock); + + return ret; +} + +static int coalesced_mmio_buffer_release(struct inode *inode, struct file *file) +{ + + struct kvm_coalesced_mmio_buffer_dev *buffer_dev = file->private_data; + struct kvm_coalesced_mmio_dev *mmio_dev, *tmp; + struct kvm *kvm = buffer_dev->kvm; + + /* Deregister all zones associated with this ring buffer */ + mutex_lock(&kvm->slots_lock); + + list_for_each_entry_safe(mmio_dev, tmp, &kvm->coalesced_zones, list) { + if (mmio_dev->buffer_dev == buffer_dev) { + if (kvm_io_bus_unregister_dev(kvm, + mmio_dev->zone.pio ? KVM_PIO_BUS : KVM_MMIO_BUS, + &mmio_dev->dev)) + break; + } + } + + list_del(&buffer_dev->list); + kfree(buffer_dev); + + mutex_unlock(&kvm->slots_lock); + + return 0; +} + +static const struct file_operations coalesced_mmio_buffer_ops = { + .mmap = coalesced_mmio_buffer_mmap, + .release = coalesced_mmio_buffer_release, +}; + +int kvm_vm_ioctl_create_coalesced_mmio_buffer(struct kvm *kvm) +{ + int ret; + struct kvm_coalesced_mmio_buffer_dev *dev; + + dev = kzalloc(sizeof(struct kvm_coalesced_mmio_buffer_dev), + GFP_KERNEL_ACCOUNT); + if (!dev) + return -ENOMEM; + + dev->kvm = kvm; + spin_lock_init(&dev->ring_lock); + + ret = anon_inode_getfd("coalesced_mmio_buf", &coalesced_mmio_buffer_ops, + dev, O_RDWR | O_CLOEXEC); + if (ret < 0) { + kfree(dev); + return ret; + } + + mutex_lock(&kvm->slots_lock); + list_add_tail(&dev->list, &kvm->coalesced_buffers); + mutex_unlock(&kvm->slots_lock); + + return ret; +} + int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm, struct kvm_coalesced_mmio_zone *zone) { int ret; struct kvm_coalesced_mmio_dev *dev; + struct kvm_coalesced_mmio_buffer_dev *buffer_dev = NULL; if (zone->pio != 1 && zone->pio != 0) return -EINVAL; @@ -149,6 +273,7 @@ int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm, kvm_iodevice_init(&dev->dev, &coalesced_mmio_ops); dev->kvm = kvm; dev->zone = *zone; + dev->buffer_dev = buffer_dev; mutex_lock(&kvm->slots_lock); ret = kvm_io_bus_register_dev(kvm, diff --git a/virt/kvm/coalesced_mmio.h b/virt/kvm/coalesced_mmio.h index 36f84264ed25..37d9d8f325bb 100644 --- a/virt/kvm/coalesced_mmio.h +++ b/virt/kvm/coalesced_mmio.h @@ -20,6 +20,14 @@ struct kvm_coalesced_mmio_dev { struct kvm_io_device dev; struct kvm *kvm; struct kvm_coalesced_mmio_zone zone; + struct kvm_coalesced_mmio_buffer_dev *buffer_dev; +}; + +struct kvm_coalesced_mmio_buffer_dev { + struct list_head list; + struct kvm *kvm; + spinlock_t ring_lock; + struct kvm_coalesced_mmio_ring *ring; }; int kvm_coalesced_mmio_init(struct kvm *kvm); @@ -28,6 +36,7 @@ int kvm_vm_ioctl_register_coalesced_mmio(struct kvm *kvm, struct kvm_coalesced_mmio_zone *zone); int kvm_vm_ioctl_unregister_coalesced_mmio(struct kvm *kvm, struct kvm_coalesced_mmio_zone *zone); +int kvm_vm_ioctl_create_coalesced_mmio_buffer(struct kvm *kvm); #else diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 238940c3cb32..9f6ad6e03317 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -5244,6 +5244,9 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_unregister_coalesced_mmio(kvm, &zone); break; } + case KVM_CREATE_COALESCED_MMIO_BUFFER: + r = kvm_vm_ioctl_create_coalesced_mmio_buffer(kvm); + break; #endif case KVM_IRQFD: { struct kvm_irqfd data;