Message ID | 20200424062540.23679-2-chirantan@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] fuse: virtiofs: Fix nullptr dereference | expand |
On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote: > Use simple round-robin scheduling based on the `unique` field of the > fuse request to spread requests across multiple queues, if supported by > the device. Multiqueue is not intended to be used this way and this patch will reduce performance*. I don't think it should be merged. * I know it increases performance for you :) but hear me out: The purpose of multiqueue is for SMP scalability. It allows queues to be processed with CPU/NUMA affinity to the vCPU that submitted the request (i.e. the virtqueue processing thread runs on a sibling physical CPU core). Each queue has its own MSI-X interrupt so that completion interrupts can be processed on the same vCPU that submitted the request. Spreading requests across queues defeats all this. Virtqueue processing threads that are located in the wrong place will now process the requests. Completion interrupts will wake up a vCPU that did not submit the request and IPIs are necessary to notify the vCPU that originally submitted the request. Even if you don't care about SMP performance, using multiqueue as a workaround for missing request parallelism still won't yield the best results. The guest should be able to submit up to the maximum queue depth of the physical storage device. Many Linux block drivers have max queue depths of 64. This would require 64 virtqueues (plus the queue selection algorithm would have to utilize each one) and shows how wasteful this approach is. Instead of modifying the guest driver, please implement request parallelism in your device implementation. Device implementations should pop a virtqueue element, submit I/O, and then move on to the next virtqueue element instead of waiting for the I/O to complete. This can be done with a thread pool, coroutines, async I/O APIs, etc. The C implementation of virtiofsd has request parallelism and there is work underway to do it in Rust for cloud-hypervisor too. Perhaps you can try the cloud-hypervisor code, I have CCed Sergio Lopez? Thanks, Stefan
On Mon, Apr 27, 2020 at 04:19:34PM +0100, Stefan Hajnoczi wrote: > On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote: > The C implementation of virtiofsd has request parallelism and there is > work underway to do it in Rust for cloud-hypervisor too. Perhaps you > can try the cloud-hypervisor code, I have CCed Sergio Lopez? Sergio pointed me to this commit: https://github.com/cloud-hypervisor/cloud-hypervisor/commit/710520e9a1860c587b4b3ec6aadc466ba1709557 Stefan
Hi Stefan, On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote: > > Use simple round-robin scheduling based on the `unique` field of the > > fuse request to spread requests across multiple queues, if supported by > > the device. > > Multiqueue is not intended to be used this way and this patch will > reduce performance*. I don't think it should be merged. > > * I know it increases performance for you :) but hear me out: > It actually doesn't increase performance for me :-(. It did increase performance when I tested it on my 96-core workstation but on our actual target devices, which only have 2 cores, using multiqueue or having additional threads in the server actually made performance worse. > The purpose of multiqueue is for SMP scalability. It allows queues to > be processed with CPU/NUMA affinity to the vCPU that submitted the > request (i.e. the virtqueue processing thread runs on a sibling physical > CPU core). Each queue has its own MSI-X interrupt so that completion > interrupts can be processed on the same vCPU that submitted the request. > > Spreading requests across queues defeats all this. Virtqueue processing > threads that are located in the wrong place will now process the > requests. Completion interrupts will wake up a vCPU that did not submit > the request and IPIs are necessary to notify the vCPU that originally > submitted the request. > Thanks for the explanation. I wasn't aware of this aspect of using multiple queues but it makes sense now why we wouldn't want to spread the requests across different queues. > Even if you don't care about SMP performance, using multiqueue as a > workaround for missing request parallelism still won't yield the best > results. The guest should be able to submit up to the maximum queue > depth of the physical storage device. Many Linux block drivers have max > queue depths of 64. This would require 64 virtqueues (plus the queue > selection algorithm would have to utilize each one) and shows how > wasteful this approach is. > I understand this but in practice unlike the virtio-blk workload, which is nothing but reads and writes to a single file, the virtio-fs workload tends to mix a bunch of metadata operations with data transfers. The metadata operations should be mostly handled out of the host's file cache so it's unlikely virtio-fs would really be able to fully utilize the underlying storage short of reading or writing a really huge file. > Instead of modifying the guest driver, please implement request > parallelism in your device implementation. Yes, we have tried this already [1][2]. As I mentioned above, having additional threads in the server actually made performance worse. My theory is that when the device only has 2 cpus, having additional threads on the host that need cpu time ends up taking time away from the guest vcpu. We're now looking at switching to io_uring so that we can submit multiple requests from a single thread. The multiqueue change was small and I wasn't aware of the SMP implications, which is why I sent this patch. Thanks, Chirantan [1] https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2103602 [2] https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2103603
On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote: > On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote: > > Even if you don't care about SMP performance, using multiqueue as a > > workaround for missing request parallelism still won't yield the best > > results. The guest should be able to submit up to the maximum queue > > depth of the physical storage device. Many Linux block drivers have max > > queue depths of 64. This would require 64 virtqueues (plus the queue > > selection algorithm would have to utilize each one) and shows how > > wasteful this approach is. > > > > I understand this but in practice unlike the virtio-blk workload, > which is nothing but reads and writes to a single file, the virtio-fs > workload tends to mix a bunch of metadata operations with data > transfers. The metadata operations should be mostly handled out of > the host's file cache so it's unlikely virtio-fs would really be able > to fully utilize the underlying storage short of reading or writing a > really huge file. I agree that a proportion of heavy I/O workloads on virtio-blk become heavy metadata I/O workloads on virtio-fs. However, workloads consisting mostly of READ, WRITE, and FLUSH operations still exist on virtio-fs. Databases, audio/video file streaming, etc are bottlenecked on I/O performance. They need to perform well and virtio-fs should strive to do that. > > Instead of modifying the guest driver, please implement request > > parallelism in your device implementation. > > Yes, we have tried this already [1][2]. As I mentioned above, having > additional threads in the server actually made performance worse. My > theory is that when the device only has 2 cpus, having additional > threads on the host that need cpu time ends up taking time away from > the guest vcpu. We're now looking at switching to io_uring so that we > can submit multiple requests from a single thread. The host has 2 CPUs? How many vCPUs does the guest have? What is the physical storage device? What is the host file system? io_uring's vocabulary is expanding. It can now do openat2(2), close(2), statx(2), but not mkdir(2), unlink(2), rename(2), etc. I guess there are two options: 1. Fall back to threads for FUSE operations that cannot yet be done via io_uring. 2. Process FUSE operations that cannot be done via io_uring synchronously. Stefan
On Sat, May 2, 2020 at 12:48 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote: > > On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > Instead of modifying the guest driver, please implement request > > > parallelism in your device implementation. > > > > Yes, we have tried this already [1][2]. As I mentioned above, having > > additional threads in the server actually made performance worse. My > > theory is that when the device only has 2 cpus, having additional > > threads on the host that need cpu time ends up taking time away from > > the guest vcpu. We're now looking at switching to io_uring so that we > > can submit multiple requests from a single thread. > > The host has 2 CPUs? How many vCPUs does the guest have? What is the > physical storage device? What is the host file system? The host has 2 cpus. The guest has 1 vcpu. The physical storage device is an internal ssd. The file system is ext4 with directory encryption. > > io_uring's vocabulary is expanding. It can now do openat2(2), close(2), > statx(2), but not mkdir(2), unlink(2), rename(2), etc. > > I guess there are two options: > 1. Fall back to threads for FUSE operations that cannot yet be done via > io_uring. > 2. Process FUSE operations that cannot be done via io_uring > synchronously. > I'm hoping that using io_uring for just the reads and writes should give us a big enough improvement that we can do the rest of the operations synchronously. Chirantan
On Thu, May 07, 2020 at 05:10:15PM +0900, Chirantan Ekbote wrote: > On Sat, May 2, 2020 at 12:48 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote: > > > On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > io_uring's vocabulary is expanding. It can now do openat2(2), close(2), > > statx(2), but not mkdir(2), unlink(2), rename(2), etc. > > > > I guess there are two options: > > 1. Fall back to threads for FUSE operations that cannot yet be done via > > io_uring. > > 2. Process FUSE operations that cannot be done via io_uring > > synchronously. > > > > I'm hoping that using io_uring for just the reads and writes should > give us a big enough improvement that we can do the rest of the > operations synchronously. Sounds like a good idea. Stefan
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c index 97eec7522bf20..cad9f76c2519c 100644 --- a/fs/fuse/dev.c +++ b/fs/fuse/dev.c @@ -25,10 +25,6 @@ MODULE_ALIAS_MISCDEV(FUSE_MINOR); MODULE_ALIAS("devname:fuse"); -/* Ordinary requests have even IDs, while interrupts IDs are odd */ -#define FUSE_INT_REQ_BIT (1ULL << 0) -#define FUSE_REQ_ID_STEP (1ULL << 1) - static struct kmem_cache *fuse_req_cachep; static struct fuse_dev *fuse_get_dev(struct file *file) diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h index ca344bf714045..110b917d950a8 100644 --- a/fs/fuse/fuse_i.h +++ b/fs/fuse/fuse_i.h @@ -47,6 +47,10 @@ /** Number of dentries for each connection in the control filesystem */ #define FUSE_CTL_NUM_DENTRIES 5 +/* Ordinary requests have even IDs, while interrupts IDs are odd */ +#define FUSE_INT_REQ_BIT (1ULL << 0) +#define FUSE_REQ_ID_STEP (1ULL << 1) + /** List of active connections */ extern struct list_head fuse_conn_list; diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c index d3c38222a7e4e..c5129fd27930c 100644 --- a/fs/fuse/virtio_fs.c +++ b/fs/fuse/virtio_fs.c @@ -980,7 +980,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq, static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq) __releases(fiq->lock) { - unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */ + unsigned int queue_id; struct virtio_fs *fs; struct fuse_req *req; struct virtio_fs_vq *fsvq; @@ -994,6 +994,8 @@ __releases(fiq->lock) spin_unlock(&fiq->lock); fs = fiq->priv; + queue_id = ((req->in.h.unique / FUSE_REQ_ID_STEP) % + (uint64_t)fs->num_request_queues) + VQ_REQUEST; pr_debug("%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n", __func__, req->in.h.opcode, req->in.h.unique,
Use simple round-robin scheduling based on the `unique` field of the fuse request to spread requests across multiple queues, if supported by the device. Signed-off-by: Chirantan Ekbote <chirantan@chromium.org> --- fs/fuse/dev.c | 4 ---- fs/fuse/fuse_i.h | 4 ++++ fs/fuse/virtio_fs.c | 4 +++- 3 files changed, 7 insertions(+), 5 deletions(-)