diff mbox series

[2/2] fuse: virtiofs: Add basic multiqueue support

Message ID 20200424062540.23679-2-chirantan@chromium.org (mailing list archive)
State New, archived
Headers show
Series [1/2] fuse: virtiofs: Fix nullptr dereference | expand

Commit Message

Chirantan Ekbote April 24, 2020, 6:25 a.m. UTC
Use simple round-robin scheduling based on the `unique` field of the
fuse request to spread requests across multiple queues, if supported by
the device.

Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
---
 fs/fuse/dev.c       | 4 ----
 fs/fuse/fuse_i.h    | 4 ++++
 fs/fuse/virtio_fs.c | 4 +++-
 3 files changed, 7 insertions(+), 5 deletions(-)

Comments

Stefan Hajnoczi April 27, 2020, 3:19 p.m. UTC | #1
On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote:
> Use simple round-robin scheduling based on the `unique` field of the
> fuse request to spread requests across multiple queues, if supported by
> the device.

Multiqueue is not intended to be used this way and this patch will
reduce performance*.  I don't think it should be merged.

* I know it increases performance for you :) but hear me out:

The purpose of multiqueue is for SMP scalability.  It allows queues to
be processed with CPU/NUMA affinity to the vCPU that submitted the
request (i.e. the virtqueue processing thread runs on a sibling physical
CPU core).  Each queue has its own MSI-X interrupt so that completion
interrupts can be processed on the same vCPU that submitted the request.

Spreading requests across queues defeats all this.  Virtqueue processing
threads that are located in the wrong place will now process the
requests.  Completion interrupts will wake up a vCPU that did not submit
the request and IPIs are necessary to notify the vCPU that originally
submitted the request.

Even if you don't care about SMP performance, using multiqueue as a
workaround for missing request parallelism still won't yield the best
results.  The guest should be able to submit up to the maximum queue
depth of the physical storage device.  Many Linux block drivers have max
queue depths of 64.  This would require 64 virtqueues (plus the queue
selection algorithm would have to utilize each one) and shows how
wasteful this approach is.

Instead of modifying the guest driver, please implement request
parallelism in your device implementation.  Device implementations
should pop a virtqueue element, submit I/O, and then move on to the next
virtqueue element instead of waiting for the I/O to complete.  This can
be done with a thread pool, coroutines, async I/O APIs, etc.

The C implementation of virtiofsd has request parallelism and there is
work underway to do it in Rust for cloud-hypervisor too.  Perhaps you
can try the cloud-hypervisor code, I have CCed Sergio Lopez?

Thanks,
Stefan
Stefan Hajnoczi April 27, 2020, 3:22 p.m. UTC | #2
On Mon, Apr 27, 2020 at 04:19:34PM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote:
> The C implementation of virtiofsd has request parallelism and there is
> work underway to do it in Rust for cloud-hypervisor too.  Perhaps you
> can try the cloud-hypervisor code, I have CCed Sergio Lopez?

Sergio pointed me to this commit:

https://github.com/cloud-hypervisor/cloud-hypervisor/commit/710520e9a1860c587b4b3ec6aadc466ba1709557

Stefan
Chirantan Ekbote May 1, 2020, 7:14 a.m. UTC | #3
Hi Stefan,

On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote:
> > Use simple round-robin scheduling based on the `unique` field of the
> > fuse request to spread requests across multiple queues, if supported by
> > the device.
>
> Multiqueue is not intended to be used this way and this patch will
> reduce performance*.  I don't think it should be merged.
>
> * I know it increases performance for you :) but hear me out:
>

It actually doesn't increase performance for me :-(.  It did increase
performance when I tested it on my 96-core workstation but on our
actual target devices, which only have 2 cores, using multiqueue or
having additional threads in the server actually made performance
worse.

> The purpose of multiqueue is for SMP scalability.  It allows queues to
> be processed with CPU/NUMA affinity to the vCPU that submitted the
> request (i.e. the virtqueue processing thread runs on a sibling physical
> CPU core).  Each queue has its own MSI-X interrupt so that completion
> interrupts can be processed on the same vCPU that submitted the request.
>
> Spreading requests across queues defeats all this.  Virtqueue processing
> threads that are located in the wrong place will now process the
> requests.  Completion interrupts will wake up a vCPU that did not submit
> the request and IPIs are necessary to notify the vCPU that originally
> submitted the request.
>

Thanks for the explanation.  I wasn't aware of this aspect of using
multiple queues but it makes sense now why we wouldn't want to spread
the requests across different queues.

> Even if you don't care about SMP performance, using multiqueue as a
> workaround for missing request parallelism still won't yield the best
> results.  The guest should be able to submit up to the maximum queue
> depth of the physical storage device.  Many Linux block drivers have max
> queue depths of 64.  This would require 64 virtqueues (plus the queue
> selection algorithm would have to utilize each one) and shows how
> wasteful this approach is.
>

I understand this but in practice unlike the virtio-blk workload,
which is nothing but reads and writes to a single file, the virtio-fs
workload tends to mix a bunch of metadata operations with data
transfers.  The metadata operations should be mostly handled out of
the host's file cache so it's unlikely virtio-fs would really be able
to fully utilize the underlying storage short of reading or writing a
really huge file.

> Instead of modifying the guest driver, please implement request
> parallelism in your device implementation.

Yes, we have tried this already [1][2].  As I mentioned above, having
additional threads in the server actually made performance worse.  My
theory is that when the device only has 2 cpus, having additional
threads on the host that need cpu time ends up taking time away from
the guest vcpu.  We're now looking at switching to io_uring so that we
can submit multiple requests from a single thread.

The multiqueue change was small and I wasn't aware of the SMP
implications, which is why I sent this patch.

Thanks,
Chirantan

[1] https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2103602
[2] https://chromium-review.googlesource.com/c/chromiumos/platform/crosvm/+/2103603
Stefan Hajnoczi May 1, 2020, 3:47 p.m. UTC | #4
On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote:
> On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Fri, Apr 24, 2020 at 03:25:40PM +0900, Chirantan Ekbote wrote:
> > Even if you don't care about SMP performance, using multiqueue as a
> > workaround for missing request parallelism still won't yield the best
> > results.  The guest should be able to submit up to the maximum queue
> > depth of the physical storage device.  Many Linux block drivers have max
> > queue depths of 64.  This would require 64 virtqueues (plus the queue
> > selection algorithm would have to utilize each one) and shows how
> > wasteful this approach is.
> >
> 
> I understand this but in practice unlike the virtio-blk workload,
> which is nothing but reads and writes to a single file, the virtio-fs
> workload tends to mix a bunch of metadata operations with data
> transfers.  The metadata operations should be mostly handled out of
> the host's file cache so it's unlikely virtio-fs would really be able
> to fully utilize the underlying storage short of reading or writing a
> really huge file.

I agree that a proportion of heavy I/O workloads on virtio-blk become
heavy metadata I/O workloads on virtio-fs.

However, workloads consisting mostly of READ, WRITE, and FLUSH
operations still exist on virtio-fs.  Databases, audio/video file
streaming, etc are bottlenecked on I/O performance.  They need to
perform well and virtio-fs should strive to do that.

> > Instead of modifying the guest driver, please implement request
> > parallelism in your device implementation.
> 
> Yes, we have tried this already [1][2].  As I mentioned above, having
> additional threads in the server actually made performance worse.  My
> theory is that when the device only has 2 cpus, having additional
> threads on the host that need cpu time ends up taking time away from
> the guest vcpu.  We're now looking at switching to io_uring so that we
> can submit multiple requests from a single thread.

The host has 2 CPUs?  How many vCPUs does the guest have?  What is the
physical storage device?  What is the host file system?

io_uring's vocabulary is expanding.  It can now do openat2(2), close(2),
statx(2), but not mkdir(2), unlink(2), rename(2), etc.

I guess there are two options:
1. Fall back to threads for FUSE operations that cannot yet be done via
   io_uring.
2. Process FUSE operations that cannot be done via io_uring
   synchronously.

Stefan
Chirantan Ekbote May 7, 2020, 8:10 a.m. UTC | #5
On Sat, May 2, 2020 at 12:48 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote:
> > On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > Instead of modifying the guest driver, please implement request
> > > parallelism in your device implementation.
> >
> > Yes, we have tried this already [1][2].  As I mentioned above, having
> > additional threads in the server actually made performance worse.  My
> > theory is that when the device only has 2 cpus, having additional
> > threads on the host that need cpu time ends up taking time away from
> > the guest vcpu.  We're now looking at switching to io_uring so that we
> > can submit multiple requests from a single thread.
>
> The host has 2 CPUs?  How many vCPUs does the guest have?  What is the
> physical storage device?  What is the host file system?

The host has 2 cpus.  The guest has 1 vcpu.  The physical storage
device is an internal ssd.  The file system is ext4 with directory
encryption.


>
> io_uring's vocabulary is expanding.  It can now do openat2(2), close(2),
> statx(2), but not mkdir(2), unlink(2), rename(2), etc.
>
> I guess there are two options:
> 1. Fall back to threads for FUSE operations that cannot yet be done via
>    io_uring.
> 2. Process FUSE operations that cannot be done via io_uring
>    synchronously.
>

I'm hoping that using io_uring for just the reads and writes should
give us a big enough improvement that we can do the rest of the
operations synchronously.

Chirantan
Stefan Hajnoczi June 2, 2020, 9:29 a.m. UTC | #6
On Thu, May 07, 2020 at 05:10:15PM +0900, Chirantan Ekbote wrote:
> On Sat, May 2, 2020 at 12:48 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Fri, May 01, 2020 at 04:14:38PM +0900, Chirantan Ekbote wrote:
> > > On Tue, Apr 28, 2020 at 12:20 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > io_uring's vocabulary is expanding.  It can now do openat2(2), close(2),
> > statx(2), but not mkdir(2), unlink(2), rename(2), etc.
> >
> > I guess there are two options:
> > 1. Fall back to threads for FUSE operations that cannot yet be done via
> >    io_uring.
> > 2. Process FUSE operations that cannot be done via io_uring
> >    synchronously.
> >
> 
> I'm hoping that using io_uring for just the reads and writes should
> give us a big enough improvement that we can do the rest of the
> operations synchronously.

Sounds like a good idea.

Stefan
diff mbox series

Patch

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 97eec7522bf20..cad9f76c2519c 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -25,10 +25,6 @@ 
 MODULE_ALIAS_MISCDEV(FUSE_MINOR);
 MODULE_ALIAS("devname:fuse");
 
-/* Ordinary requests have even IDs, while interrupts IDs are odd */
-#define FUSE_INT_REQ_BIT (1ULL << 0)
-#define FUSE_REQ_ID_STEP (1ULL << 1)
-
 static struct kmem_cache *fuse_req_cachep;
 
 static struct fuse_dev *fuse_get_dev(struct file *file)
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index ca344bf714045..110b917d950a8 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -47,6 +47,10 @@ 
 /** Number of dentries for each connection in the control filesystem */
 #define FUSE_CTL_NUM_DENTRIES 5
 
+/* Ordinary requests have even IDs, while interrupts IDs are odd */
+#define FUSE_INT_REQ_BIT (1ULL << 0)
+#define FUSE_REQ_ID_STEP (1ULL << 1)
+
 /** List of active connections */
 extern struct list_head fuse_conn_list;
 
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index d3c38222a7e4e..c5129fd27930c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -980,7 +980,7 @@  static int virtio_fs_enqueue_req(struct virtio_fs_vq *fsvq,
 static void virtio_fs_wake_pending_and_unlock(struct fuse_iqueue *fiq)
 __releases(fiq->lock)
 {
-	unsigned int queue_id = VQ_REQUEST; /* TODO multiqueue */
+	unsigned int queue_id;
 	struct virtio_fs *fs;
 	struct fuse_req *req;
 	struct virtio_fs_vq *fsvq;
@@ -994,6 +994,8 @@  __releases(fiq->lock)
 	spin_unlock(&fiq->lock);
 
 	fs = fiq->priv;
+	queue_id = ((req->in.h.unique / FUSE_REQ_ID_STEP) %
+		    (uint64_t)fs->num_request_queues) + VQ_REQUEST;
 
 	pr_debug("%s: opcode %u unique %#llx nodeid %#llx in.len %u out.len %u\n",
 		  __func__, req->in.h.opcode, req->in.h.unique,