diff mbox series

[RFC,v2,19/19] fuse: {uring} Optimize async sends

Message ID 20240529-fuse-uring-for-6-9-rfc2-out-v1-19-d149476b1d65@ddn.com (mailing list archive)
State New
Headers show
Series fuse: fuse-over-io-uring | expand

Commit Message

Bernd Schubert May 29, 2024, 6 p.m. UTC
This is to avoid using async completion tasks
(i.e. context switches) when not needed.

Cc: io-uring@vger.kernel.org
Signed-off-by: Bernd Schubert <bschubert@ddn.com>

---
This condition should be better verified by io-uring developers.

} else if (current->io_uring) {
    /* There are two cases here
     * 1) fuse-server side uses multiple threads accessing
     *    the ring
     * 2) IO requests through io-uring
     */
    send_in_task = true;
    issue_flags = 0;
---
 fs/fuse/dev_uring.c | 57 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 46 insertions(+), 11 deletions(-)

Comments

Jens Axboe May 31, 2024, 4:24 p.m. UTC | #1
On 5/29/24 12:00 PM, Bernd Schubert wrote:
> This is to avoid using async completion tasks
> (i.e. context switches) when not needed.
> 
> Cc: io-uring@vger.kernel.org
> Signed-off-by: Bernd Schubert <bschubert@ddn.com>

This patch is very confusing, even after having pulled the other
changes. In general, would be great if the io_uring list was CC'ed on
the whole series, it's very hard to review just a single patch, when you
don't have the full picture.

Outside of that, would be super useful to include a blurb on how you set
things up for testing, and how you run the testing. That would really
help in terms of being able to run and test it, and also to propose
changes that might make a big difference.
Bernd Schubert May 31, 2024, 5:36 p.m. UTC | #2
On 5/31/24 18:24, Jens Axboe wrote:
> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>> This is to avoid using async completion tasks
>> (i.e. context switches) when not needed.
>>
>> Cc: io-uring@vger.kernel.org
>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
> 
> This patch is very confusing, even after having pulled the other
> changes. In general, would be great if the io_uring list was CC'ed on

Hmm, let me try to explain. And yes, I definitely need to add these details 
to the commit message

Without the patch:

<sending a struct fuse_req> 

fuse_uring_queue_fuse_req
    fuse_uring_send_to_ring
        io_uring_cmd_complete_in_task
        
<async task runs>
    io_uring_cmd_done()


Now I would like to call io_uring_cmd_done() directly without another task
whenever possible. I didn't benchmark it, but another task is in general
against the entire concept. That is where the patch comes in


fuse_uring_queue_fuse_req() now adds the information if io_uring_cmd_done() 
shall be called directly or via io_uring_cmd_complete_in_task().


Doing it directly requires the knowledge of issue_flags - these are the
conditions in fuse_uring_queue_fuse_req.


1) (current == queue->server_task)
fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
previous fuse_req, after completion it fetched the next fuse_req and 
wants to send it - for 'current == queue->server_task' issue flags
got stored in struct fuse_ring_queue::uring_cmd_issue_flags

2) 'else if (current->io_uring)'

(actually documented in the code)

2.1 This might be through IORING_OP_URING_CMD as well, but then server 
side uses multiple threads to access the same ring - not nice. We only
store issue_flags into the queue for 'current == queue->server_task', so
we do not know issue_flags - sending through task is needed.

2.2 This might be an application request through the mount point, through
the io-uring interface. We do know issue flags either.
(That one was actually a surprise for me, when xfstests caught it.
Initially I had a condition to send without the extra task then lockdep
caught that.


In both cases it has to use a tasks.


My question here is if 'current->io_uring' is reliable.


3) everything else

3.1) For async requests, interesting are cached reads and writes here. At a minimum
writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
we need a task as well

3.2) sync - no lock being hold, it can send without the extra task.


> the whole series, it's very hard to review just a single patch, when you
> don't have the full picture.

Sorry, I will do that for the next version.

> 
> Outside of that, would be super useful to include a blurb on how you set
> things up for testing, and how you run the testing. That would really
> help in terms of being able to run and test it, and also to propose
> changes that might make a big difference.
> 

Will do in the next version. 
You basically need my libfuse uring branch
(right now commit history is not cleaned up) and follow
instructions in <libfuse>/xfstests/README.md how to run xfstests.
Missing is a slight patch for that dir to set extra daemon parameters,
like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
during the next days.


Thanks,
Bernd
Jens Axboe May 31, 2024, 7:10 p.m. UTC | #3
On 5/31/24 11:36 AM, Bernd Schubert wrote:
> On 5/31/24 18:24, Jens Axboe wrote:
>> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>>> This is to avoid using async completion tasks
>>> (i.e. context switches) when not needed.
>>>
>>> Cc: io-uring@vger.kernel.org
>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>
>> This patch is very confusing, even after having pulled the other
>> changes. In general, would be great if the io_uring list was CC'ed on
> 
> Hmm, let me try to explain. And yes, I definitely need to add these details 
> to the commit message
> 
> Without the patch:
> 
> <sending a struct fuse_req> 
> 
> fuse_uring_queue_fuse_req
>     fuse_uring_send_to_ring
>         io_uring_cmd_complete_in_task
>         
> <async task runs>
>     io_uring_cmd_done()

And this is a worthwhile optimization, you always want to complete it
line if at all possible. But none of this logic or code belongs in fuse,
it really should be provided by io_uring helpers.

I would just drop this patch for now and focus on the core
functionality. Send out a version with that, and then we'll be happy to
help this as performant as it can be. This is where the ask on "how to
reproduce your numbers" comes from - with that, it's usually trivial to
spot areas where things could be improved. And I strongly suspect that
will involve providing you with the right API to use here, and perhaps
refactoring a bit on the fuse side. Making up issue_flags is _really_
not something a user should do.

> 1) (current == queue->server_task)
> fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
> previous fuse_req, after completion it fetched the next fuse_req and 
> wants to send it - for 'current == queue->server_task' issue flags
> got stored in struct fuse_ring_queue::uring_cmd_issue_flags

And queue->server_task is the owner of the ring? Then yes that is safe
> 
> 2) 'else if (current->io_uring)'
> 
> (actually documented in the code)
> 
> 2.1 This might be through IORING_OP_URING_CMD as well, but then server 
> side uses multiple threads to access the same ring - not nice. We only
> store issue_flags into the queue for 'current == queue->server_task', so
> we do not know issue_flags - sending through task is needed.

What's the path leading to you not having the issue_flags?

> 2.2 This might be an application request through the mount point, through
> the io-uring interface. We do know issue flags either.
> (That one was actually a surprise for me, when xfstests caught it.
> Initially I had a condition to send without the extra task then lockdep
> caught that.

In general, if you don't know the context (eg you don't have issue_flags
passed in), you should probably assume the only way is to sanely proceed
is to have it processed by the task itself.

> 
> In both cases it has to use a tasks.
> 
> 
> My question here is if 'current->io_uring' is reliable.

Yes that will be reliable in the sense that it tells you that the
current task has (at least) one io_uring context setup. But it doesn't
tell you anything beyond that, like if it's the owner of this request.

> 3) everything else
> 
> 3.1) For async requests, interesting are cached reads and writes here. At a minimum
> writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
> we need a task as well
> 
> 3.2) sync - no lock being hold, it can send without the extra task.

As mentioned, let's drop this patch 19 for now. Send out what you have
with instructions on how to test it, and I'll give it a spin and see
what we can do about this.

>> Outside of that, would be super useful to include a blurb on how you set
>> things up for testing, and how you run the testing. That would really
>> help in terms of being able to run and test it, and also to propose
>> changes that might make a big difference.
>>
> 
> Will do in the next version. 
> You basically need my libfuse uring branch
> (right now commit history is not cleaned up) and follow
> instructions in <libfuse>/xfstests/README.md how to run xfstests.
> Missing is a slight patch for that dir to set extra daemon parameters,
> like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
> during the next days.

I'll leave the xfstests to you for now, but running some perf testing
just to verify how it's being used would be useful and help improve it
for sure.
Bernd Schubert June 1, 2024, 4:37 p.m. UTC | #4
On 5/31/24 21:10, Jens Axboe wrote:
> On 5/31/24 11:36 AM, Bernd Schubert wrote:
>> On 5/31/24 18:24, Jens Axboe wrote:
>>> On 5/29/24 12:00 PM, Bernd Schubert wrote:
>>>> This is to avoid using async completion tasks
>>>> (i.e. context switches) when not needed.
>>>>
>>>> Cc: io-uring@vger.kernel.org
>>>> Signed-off-by: Bernd Schubert <bschubert@ddn.com>
>>>
>>> This patch is very confusing, even after having pulled the other
>>> changes. In general, would be great if the io_uring list was CC'ed on
>>
>> Hmm, let me try to explain. And yes, I definitely need to add these details 
>> to the commit message
>>
>> Without the patch:
>>
>> <sending a struct fuse_req> 
>>
>> fuse_uring_queue_fuse_req
>>     fuse_uring_send_to_ring
>>         io_uring_cmd_complete_in_task
>>         
>> <async task runs>
>>     io_uring_cmd_done()
> 
> And this is a worthwhile optimization, you always want to complete it
> line if at all possible. But none of this logic or code belongs in fuse,
> it really should be provided by io_uring helpers.
> 
> I would just drop this patch for now and focus on the core
> functionality. Send out a version with that, and then we'll be happy to
> help this as performant as it can be. This is where the ask on "how to
> reproduce your numbers" comes from - with that, it's usually trivial to
> spot areas where things could be improved. And I strongly suspect that
> will involve providing you with the right API to use here, and perhaps
> refactoring a bit on the fuse side. Making up issue_flags is _really_
> not something a user should do.

Great that you agree, I don't like the issue_flag handling in fuse code either. 
I will also follow your suggestion to drop this patch. 


> 
>> 1) (current == queue->server_task)
>> fuse_uring_cmd (IORING_OP_URING_CMD) received a completion for a 
>> previous fuse_req, after completion it fetched the next fuse_req and 
>> wants to send it - for 'current == queue->server_task' issue flags
>> got stored in struct fuse_ring_queue::uring_cmd_issue_flags
> 
> And queue->server_task is the owner of the ring? Then yes that is safe

Yeah, it is the thread that submits SQEs - should be the owner of the ring, 
unless daemon side does something wrong (given that there are several
userspace implementation and not a single libfuse only, we need to expect
and handle implementation errors, though).

>>
>> 2) 'else if (current->io_uring)'
>>
>> (actually documented in the code)
>>
>> 2.1 This might be through IORING_OP_URING_CMD as well, but then server 
>> side uses multiple threads to access the same ring - not nice. We only
>> store issue_flags into the queue for 'current == queue->server_task', so
>> we do not know issue_flags - sending through task is needed.
> 
> What's the path leading to you not having the issue_flags?

We get issue flags here, but I want to keep changes to libfuse small and want
to avoid changing non uring related function signatures. Which is the the
why we store issue_flags for the presumed ring owner thread in the queue data
structure, but we don't have it for possible other threads then

Example:

IORING_OP_URING_CMD
   fuse_uring_cmd
       fuse_uring_commit_and_release
           fuse_uring_req_end_and_get_next --> until here issue_flags passed
               fuse_request_end -> generic fuse function,  issue_flags not passed
                   req->args->end() / fuse_writepage_end
                       fuse_simple_background
                           fuse_request_queue_background
                               fuse_request_queue_background_uring
                                   fuse_uring_queue_fuse_req
                                       fuse_uring_send_to_ring
                                           io_uring_cmd_done
                   
      
I.e. we had issue_flags up to fuse_uring_req_end_and_get_next(), but then
call into generic fuse functions and stop passing through issue_flags.
For the ring-owner we take issue flags stored by fuse_uring_cmd()
into struct fuse_ring_queue, but if daemon side uses multiple threads to
access the ring we won't have that. Well, we could allow it and store
it into an array or rb-tree, but I don't like that multiple threads access
something that is optimized to have a thread per core already.

> 
>> 2.2 This might be an application request through the mount point, through
>> the io-uring interface. We do know issue flags either.
>> (That one was actually a surprise for me, when xfstests caught it.
>> Initially I had a condition to send without the extra task then lockdep
>> caught that.
> 
> In general, if you don't know the context (eg you don't have issue_flags
> passed in), you should probably assume the only way is to sanely proceed
> is to have it processed by the task itself.
> 
>>
>> In both cases it has to use a tasks.
>>
>>
>> My question here is if 'current->io_uring' is reliable.
> 
> Yes that will be reliable in the sense that it tells you that the
> current task has (at least) one io_uring context setup. But it doesn't
> tell you anything beyond that, like if it's the owner of this request.

Yeah, you can see that it just checks for current->io_uring and then
uses a task.

> 
>> 3) everything else
>>
>> 3.1) For async requests, interesting are cached reads and writes here. At a minimum
>> writes a holding a spin lock and that lock conflicts with the mutex io-uring is taking - 
>> we need a task as well
>>
>> 3.2) sync - no lock being hold, it can send without the extra task.
> 
> As mentioned, let's drop this patch 19 for now. Send out what you have
> with instructions on how to test it, and I'll give it a spin and see
> what we can do about this.
> 
>>> Outside of that, would be super useful to include a blurb on how you set
>>> things up for testing, and how you run the testing. That would really
>>> help in terms of being able to run and test it, and also to propose
>>> changes that might make a big difference.
>>>
>>
>> Will do in the next version. 
>> You basically need my libfuse uring branch
>> (right now commit history is not cleaned up) and follow
>> instructions in <libfuse>/xfstests/README.md how to run xfstests.
>> Missing is a slight patch for that dir to set extra daemon parameters,
>> like direct-io (fuse' FOPEN_DIRECT_IO) and io-uring. Will add that libfuse
>> during the next days.
> 
> I'll leave the xfstests to you for now, but running some perf testing
> just to verify how it's being used would be useful and help improve it
> for sure.
> 

Ah you meant performance tests. I used libfuse/example/passthrough_hp from
my uring branch and then fio on top of that for reads/writes and mdtest from
the ior repo for metadata. Maybe I should upload my scripts somewhere.


Thanks,
Beernd
diff mbox series

Patch

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index cdc5836edb6e..74407e5e86fa 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -32,7 +32,8 @@ 
 #include <linux/io_uring/cmd.h>
 
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error);
+					    bool set_err, int error,
+					    unsigned int issue_flags);
 
 static void fuse_ring_ring_ent_unset_userspace(struct fuse_ring_ent *ent)
 {
@@ -682,7 +683,9 @@  static int fuse_uring_copy_to_ring(struct fuse_ring *ring, struct fuse_req *req,
  * userspace will read it
  * This is comparable with classical read(/dev/fuse)
  */
-static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
+static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent,
+				    unsigned int issue_flags,
+				    bool send_in_task)
 {
 	struct fuse_ring *ring = ring_ent->queue->ring;
 	struct fuse_ring_req *rreq = ring_ent->rreq;
@@ -723,13 +726,16 @@  static void fuse_uring_send_to_ring(struct fuse_ring_ent *ring_ent)
 		 __func__, ring_ent->queue->qid, ring_ent->tag, ring_ent->state,
 		 rreq->in.opcode, rreq->in.unique);
 
-	io_uring_cmd_complete_in_task(ring_ent->cmd,
-				      fuse_uring_async_send_to_ring);
+	if (send_in_task)
+		io_uring_cmd_complete_in_task(ring_ent->cmd,
+					      fuse_uring_async_send_to_ring);
+	else
+		io_uring_cmd_done(ring_ent->cmd, 0, 0, issue_flags);
 
 	return;
 
 err:
-	fuse_uring_req_end_and_get_next(ring_ent, true, err);
+	fuse_uring_req_end_and_get_next(ring_ent, true, err, issue_flags);
 }
 
 /*
@@ -806,7 +812,8 @@  static bool fuse_uring_ent_release_and_fetch(struct fuse_ring_ent *ring_ent)
  * has lock/unlock/lock to avoid holding the lock on calling fuse_request_end
  */
 static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
-					    bool set_err, int error)
+					    bool set_err, int error,
+					    unsigned int issue_flags)
 {
 	struct fuse_req *req = ring_ent->fuse_req;
 	int has_next;
@@ -822,7 +829,7 @@  static void fuse_uring_req_end_and_get_next(struct fuse_ring_ent *ring_ent,
 	has_next = fuse_uring_ent_release_and_fetch(ring_ent);
 	if (has_next) {
 		/* called within uring context - use provided flags */
-		fuse_uring_send_to_ring(ring_ent);
+		fuse_uring_send_to_ring(ring_ent, issue_flags, false);
 	}
 }
 
@@ -857,7 +864,7 @@  static void fuse_uring_commit_and_release(struct fuse_dev *fud,
 out:
 	pr_devel("%s:%d ret=%zd op=%d req-ret=%d\n", __func__, __LINE__, err,
 		 req->args->opcode, req->out.h.error);
-	fuse_uring_req_end_and_get_next(ring_ent, set_err, err);
+	fuse_uring_req_end_and_get_next(ring_ent, set_err, err, issue_flags);
 }
 
 /*
@@ -1156,10 +1163,12 @@  int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 	struct fuse_ring_queue *queue;
 	struct fuse_ring_ent *ring_ent = NULL;
 	int res;
-	int async = test_bit(FR_BACKGROUND, &req->flags) &&
-		    !req->args->async_blocking;
+	int async_req = test_bit(FR_BACKGROUND, &req->flags);
+	int async = async_req && !req->args->async_blocking;
 	struct list_head *ent_queue, *req_queue;
 	int qid;
+	bool send_in_task;
+	unsigned int issue_flags;
 
 	qid = fuse_uring_get_req_qid(req, ring, async);
 	queue = fuse_uring_get_queue(ring, qid);
@@ -1182,11 +1191,37 @@  int fuse_uring_queue_fuse_req(struct fuse_conn *fc, struct fuse_req *req)
 			list_first_entry(ent_queue, struct fuse_ring_ent, list);
 		list_del(&ring_ent->list);
 		fuse_uring_add_req_to_ring_ent(ring_ent, req);
+		if (current == queue->server_task) {
+			issue_flags = queue->uring_cmd_issue_flags;
+		} else if (current->io_uring) {
+			/* There are two cases here
+			 * 1) fuse-server side uses multiple threads accessing
+			 *    the ring. We only have stored issue_flags for
+			 *    into the queue for one thread (the first one
+			 *    that submits FUSE_URING_REQ_FETCH)
+			 * 2) IO requests through io-uring, we do not have
+			 *    issue flags at all for these
+			 */
+			send_in_task = true;
+			issue_flags = 0;
+		} else {
+			if (async_req) {
+				/*
+				 * page cache writes might hold an upper
+				 * spinlockl, which conflicts with the io-uring
+				 * mutex
+				 */
+				send_in_task = true;
+				issue_flags = 0;
+			} else {
+				issue_flags = IO_URING_F_UNLOCKED;
+			}
+		}
 	}
 	spin_unlock(&queue->lock);
 
 	if (ring_ent != NULL)
-		fuse_uring_send_to_ring(ring_ent);
+		fuse_uring_send_to_ring(ring_ent, issue_flags, send_in_task);
 
 	return 0;