[RFC,v2,00/19] fuse: fuse-over-io-uring

Message ID	20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com (mailing list archive)
Headers	show Received: from outbound-ip168b.ess.barracuda.com (outbound-ip168b.ess.barracuda.com [209.222.82.102]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 30EFE79F2 for <linux-fsdevel@vger.kernel.org>; Wed, 29 May 2024 18:01:11 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of ddn.com designates 50.222.100.11 as permitted sender) receiver=protection.outlook.com; client-ip=50.222.100.11; helo=uww-mrp-01.datadirectnet.com; pr=C From: Bernd Schubert <bschubert@ddn.com> Subject: [PATCH RFC v2 00/19] fuse: fuse-over-io-uring Date: Wed, 29 May 2024 20:00:35 +0200 Message-Id: <20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: Miklos Szeredi <miklos@szeredi.hu>, Amir Goldstein <amir73il@gmail.com>, linux-fsdevel@vger.kernel.org, Bernd Schubert <bschubert@ddn.com>, bernd.schubert@fastmail.fm Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org, Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>, Andrei Vagin <avagin@google.com>, io-uring@vger.kernel.org META: Custom Rule SA085b 0.50 BSF_RULE7568M META: Custom Rule 7568M 0.00 BSF_BESS_OUTBOUND META: BESS Outbound
Series	fuse: fuse-over-io-uring \| expand [RFC,v2,00/19] fuse: fuse-over-io-uring [RFC,v2,01/19] fuse: rename to fuse_dev_end_requests and make non-static [RFC,v2,02/19] fuse: Move fuse_get_dev to header file [RFC,v2,03/19] fuse: Move request bits [RFC,v2,04/19] fuse: Add fuse-io-uring design documentation [RFC,v2,05/19] fuse: Add a uring config ioctl [RFC,v2,06/19] Add a vmalloc_node_user function [RFC,v2,07/19] fuse uring: Add an mmap method [RFC,v2,08/19] fuse: Add the queue configuration ioctl [RFC,v2,09/19] fuse: {uring} Add a dev_release exception for fuse-over-io-uring [RFC,v2,10/19] fuse: {uring} Handle SQEs - register commands [RFC,v2,11/19] fuse: Add support to copy from/to the ring buffer [RFC,v2,12/19] fuse: {uring} Add uring sqe commit and fetch support [RFC,v2,13/19] fuse: {uring} Handle uring shutdown [RFC,v2,14/19] fuse: {uring} Allow to queue to the ring [RFC,v2,15/19] export __wake_on_current_cpu [RFC,v2,16/19] fuse: {uring} Wake requests on the the current cpu [RFC,v2,17/19] fuse: {uring} Send async requests to qid of core + 1 [RFC,v2,18/19] fuse: {uring} Set a min cpu offset io-size for reads/writes [RFC,v2,19/19] fuse: {uring} Optimize async sends

Bernd Schubert May 29, 2024, 6 p.m. UTC

From: Bernd Schubert <bschubert@ddn.com>

This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state,
some major changes are still to be expected.

Motivation for these patches is all to increase fuse performance.
In fuse-over-io-uring requests avoid core switching (application
on core X, processing of fuse server on random core Y) and use
shared memory between kernel and userspace to transfer data.
Similar approaches have been taken by ZUFS and FUSE2, though
not over io-uring, but through ioctl IOs

https://lwn.net/Articles/756625/
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2

Avoiding cache line bouncing / numa systems was discussed
between Amir and Miklos before and Miklos had posted
part of the private discussion here
https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/

This cache line bouncing should be addressed by these patches
as well.

I had also noticed waitq wake-up latencies in fuse before
https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/

This spinning approach helped with performance (>40% improvement
for file creates), but due to random server side thread/core utilization
spinning cannot be well controlled in /dev/fuse mode.
With fuse-over-io-uring requests are handled on the same core
(sync requests) or on core+1 (large async requests) and performance
improvements are achieved without spinning.

Splice/zero-copy is not supported yet, Ming Lei is working
on io-uring support for ublk_drv, but I think so far there
is no final agreement on the approach to be taken yet.
Fuse-over-io-uring runs significantly faster than reads/writes
over /dev/fuse, even with splice enabled, so missing zc
should not be a blocking issue.

The patches have been tested with multiple xfstest runs in a VM
(32 cores) with a kernel that has several debug options
enabled (like KASAN and MSAN).
For some tests xfstests reports that O_DIRECT is not supported,
I need to investigate that. Interesting part is that exactly
these tests fail in plain /dev/fuse posix mode. I had to disabled
generic/650, which is enabling/disabling cpu cores - given ring
threads are bound to cores issues with that are no totally
unexpected, but then there (scheduler) kernel messages that
core binding for these threads is removed - this needs
to be further investigates.
Nice effect in io-uring mode is that tests run faster (like
generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
slow as this is with ASAN/leak-detection/etc.

The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring

If it should make review easier, patches posted here are on
this branch
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2

TODO list for next RFC versions
- Let the ring configure ioctl return information, like mmap/queue-buf size
- Request kernel side address and len for a request - avoid calculation in userspace?
- multiple IO sizes per queue (avoiding a calculation in userspace is probably even
  more important)
- FUSE_INTERRUPT handling?
- Logging (adds fields in the ioctl and also ring-request),
  any mismatch between client and server is currently very hard to understand
  through error codes

Future work
- notifications, probably on their own ring
- zero copy

I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
which, resulted in some tuning patches (at the end of the
patch series).

Some benchmark results
======================

System used for the benchmark is a 32 core (HyperThreading enabled)
Xeon E5-2650 system. I don't have local disks attached that could do
>5GB/s IOs, for paged and dio results a patched version of passthrough-hp
was used that bypasses final reads/writes.

paged reads
-----------
            128K IO size                      1024K IO size
jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
 1        1117        1921    1.72        1902       1942   1.02
 2        2502        3527    1.41        3066       3260   1.06
 4        5052        6125    1.21        5994       6097   1.02
 8        6273       10855    1.73        7101      10491   1.48
16        6373       11320    1.78        7660      11419   1.49
24        6111        9015    1.48        7600       9029   1.19
32        5725        7968    1.39        6986       7961   1.14

dio reads (1024K)
-----------------

jobs   /dev/fuse  uring   gain
1	    2023   3998	  2.42
2	    3375   7950   2.83
4	    3823   15022  3.58
8	    7796   22591  2.77
16	    8520   27864  3.27
24	    8361   20617  2.55
32	    8717   12971  1.55

mmap reads (4K)
---------------
(sequential, I probably should have made it random, sequential exposes
a rather interesting/weird 'optimized' memcpy issue - sequential becomes
reversed order 4K read)
https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/

jobs  /dev/fuse     uring    gain
1       130          323     2.49
2       219          538     2.46
4       503         1040     2.07
8       1472        2039     1.38
16      2191        3518     1.61
24      2453        4561     1.86
32      2178        5628     2.58

(Results on request, setting MAP_HUGETLB much improves performance
for both, io-uring mode then has a slight advantage only.)

creates/s
----------
threads /dev/fuse     uring   gain
1          3944       10121   2.57
2          8580       24524   2.86
4         16628       44426   2.67
8         46746       56716   1.21
16        79740      102966   1.29
20        80284      119502   1.49

(the gain drop with >=8 cores needs to be investigated)

Remaining TODO list for RFCv3:
--------------------------------
1) Let the ring configure ioctl return information,
like mmap/queue-buf size

Right now libfuse and kernel have lots of duplicated setup code
and any kind of pointer/offset mismatch results in a non-working
ring that is hard to debug - probably better when the kernel does
the calculations and returns that to server side

2) In combination with 1, ring requests should retrieve their
userspace address and length from kernel side instead of
calculating it through the mmaped queue buffer on their own.
(Introduction of FUSE_URING_BUF_ADDR_FETCH)

3) Add log buffer into the ioctl and ring-request

This is to provide better error messages (instead of just
errno)

3) Multiple IO sizes per queue

Small IOs and metadata requests do not need large buffer sizes,
we need multiple IO sizes per queue.

4) FUSE_INTERRUPT handling

These are not handled yet, kernel side is probably not difficult
anymore as ring entries take fuse requests through lists.

Long term TODO:
--------------
Notifications through io-uring, maybe with a separated ring,
but I'm not sure yet.

Changes since RFCv1
-------------------
- No need to hold the task of the server side anymore.  Also no
  ioctls/threads waiting for shutdown anymore.  Shutdown now more
  works like the traditional fuse way.
- Each queue clones the fuse and device release makes an  exception
  for io-uring. Reason is that queued IORING_OP_URING_CMD
  (through .uring_cmd) prevent a device release. I.e. a killed
  server side typically triggers fuse_abort_conn(). This was the
  reason for the async stop-monitor in v1 and reference on the daemon
  task. However it was very racy and annotated immediately by Miklos.
- In v1 the offset parameter to mmap was identifying the QID, in v2
  server side is expected to send mmap from a core bound ring thread
  in numa mode and numa node is taken through the core of that thread.
  Kernel side of the mmap buffer is stored in an rbtree and assigned
  to the right qid through an additional queue ioctl.
- Release of IORING_OP_URING_CMD is done through lists now, instead
  of iterating over the entire array of queues/entries and does not
  depend on the entry state anymore (a bit of the state is still left
  for sanity check).
- Finding free ring queue entries is done through lists and not through
  a bitmap anymore
- Many other code changes and bug fixes
- Performance tunings

---
Bernd Schubert (19):
      fuse: rename to fuse_dev_end_requests and make non-static
      fuse: Move fuse_get_dev to header file
      fuse: Move request bits
      fuse: Add fuse-io-uring design documentation
      fuse: Add a uring config ioctl
      Add a vmalloc_node_user function
      fuse uring: Add an mmap method
      fuse: Add the queue configuration ioctl
      fuse: {uring} Add a dev_release exception for fuse-over-io-uring
      fuse: {uring} Handle SQEs - register commands
      fuse: Add support to copy from/to the ring buffer
      fuse: {uring} Add uring sqe commit and fetch support
      fuse: {uring} Handle uring shutdown
      fuse: {uring} Allow to queue to the ring
      export __wake_on_current_cpu
      fuse: {uring} Wake requests on the the current cpu
      fuse: {uring} Send async requests to qid of core + 1
      fuse: {uring} Set a min cpu offset io-size for reads/writes
      fuse: {uring} Optimize async sends

 Documentation/filesystems/fuse-io-uring.rst |  167 ++++
 fs/fuse/Kconfig                             |   12 +
 fs/fuse/Makefile                            |    1 +
 fs/fuse/dev.c                               |  310 +++++--
 fs/fuse/dev_uring.c                         | 1232 +++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h                       |  395 +++++++++
 fs/fuse/file.c                              |   15 +-
 fs/fuse/fuse_dev_i.h                        |   67 ++
 fs/fuse/fuse_i.h                            |    9 +
 fs/fuse/inode.c                             |    3 +
 include/linux/vmalloc.h                     |    1 +
 include/uapi/linux/fuse.h                   |  135 +++
 kernel/sched/wait.c                         |    1 +
 mm/nommu.c                                  |    6 +
 mm/vmalloc.c                                |   41 +-
 15 files changed, 2330 insertions(+), 65 deletions(-)
---
base-commit: dd5a440a31fae6e459c0d6271dddd62825505361
change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf

Best regards,

Amir Goldstein May 30, 2024, 7:07 a.m. UTC | #1

On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote:
>
> From: Bernd Schubert <bschubert@ddn.com>
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
>
> Motivation for these patches is all to increase fuse performance.
> In fuse-over-io-uring requests avoid core switching (application
> on core X, processing of fuse server on random core Y) and use
> shared memory between kernel and userspace to transfer data.
> Similar approaches have been taken by ZUFS and FUSE2, though
> not over io-uring, but through ioctl IOs
>
> https://lwn.net/Articles/756625/
> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2
>
> Avoiding cache line bouncing / numa systems was discussed
> between Amir and Miklos before and Miklos had posted
> part of the private discussion here
> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>
> This cache line bouncing should be addressed by these patches
> as well.
>
> I had also noticed waitq wake-up latencies in fuse before
> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>
> This spinning approach helped with performance (>40% improvement
> for file creates), but due to random server side thread/core utilization
> spinning cannot be well controlled in /dev/fuse mode.
> With fuse-over-io-uring requests are handled on the same core
> (sync requests) or on core+1 (large async requests) and performance
> improvements are achieved without spinning.
>
> Splice/zero-copy is not supported yet, Ming Lei is working
> on io-uring support for ublk_drv, but I think so far there
> is no final agreement on the approach to be taken yet.
> Fuse-over-io-uring runs significantly faster than reads/writes
> over /dev/fuse, even with splice enabled, so missing zc
> should not be a blocking issue.
>
> The patches have been tested with multiple xfstest runs in a VM
> (32 cores) with a kernel that has several debug options
> enabled (like KASAN and MSAN).
> For some tests xfstests reports that O_DIRECT is not supported,
> I need to investigate that. Interesting part is that exactly
> these tests fail in plain /dev/fuse posix mode. I had to disabled
> generic/650, which is enabling/disabling cpu cores - given ring
> threads are bound to cores issues with that are no totally
> unexpected, but then there (scheduler) kernel messages that
> core binding for these threads is removed - this needs
> to be further investigates.
> Nice effect in io-uring mode is that tests run faster (like
> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
> slow as this is with ASAN/leak-detection/etc.
>
> The corresponding libfuse patches are on my uring branch,
> but need cleanup for submission - will happen during the next
> days.
> https://github.com/bsbernd/libfuse/tree/uring
>
> If it should make review easier, patches posted here are on
> this branch
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2
>
> TODO list for next RFC versions
> - Let the ring configure ioctl return information, like mmap/queue-buf size
> - Request kernel side address and len for a request - avoid calculation in userspace?
> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even
>   more important)
> - FUSE_INTERRUPT handling?
> - Logging (adds fields in the ioctl and also ring-request),
>   any mismatch between client and server is currently very hard to understand
>   through error codes
>
> Future work
> - notifications, probably on their own ring
> - zero copy
>
> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
> which, resulted in some tuning patches (at the end of the
> patch series).
>
> Some benchmark results
> ======================
>
> System used for the benchmark is a 32 core (HyperThreading enabled)
> Xeon E5-2650 system. I don't have local disks attached that could do
> >5GB/s IOs, for paged and dio results a patched version of passthrough-hp
> was used that bypasses final reads/writes.
>
> paged reads
> -----------
>             128K IO size                      1024K IO size
> jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
>  1        1117        1921    1.72        1902       1942   1.02
>  2        2502        3527    1.41        3066       3260   1.06
>  4        5052        6125    1.21        5994       6097   1.02
>  8        6273       10855    1.73        7101      10491   1.48
> 16        6373       11320    1.78        7660      11419   1.49
> 24        6111        9015    1.48        7600       9029   1.19
> 32        5725        7968    1.39        6986       7961   1.14
>
> dio reads (1024K)
> -----------------
>
> jobs   /dev/fuse  uring   gain
> 1           2023   3998   2.42
> 2           3375   7950   2.83
> 4           3823   15022  3.58
> 8           7796   22591  2.77
> 16          8520   27864  3.27
> 24          8361   20617  2.55
> 32          8717   12971  1.55
>
> mmap reads (4K)
> ---------------
> (sequential, I probably should have made it random, sequential exposes
> a rather interesting/weird 'optimized' memcpy issue - sequential becomes
> reversed order 4K read)
> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/
>
> jobs  /dev/fuse     uring    gain
> 1       130          323     2.49
> 2       219          538     2.46
> 4       503         1040     2.07
> 8       1472        2039     1.38
> 16      2191        3518     1.61
> 24      2453        4561     1.86
> 32      2178        5628     2.58
>
> (Results on request, setting MAP_HUGETLB much improves performance
> for both, io-uring mode then has a slight advantage only.)
>
> creates/s
> ----------
> threads /dev/fuse     uring   gain
> 1          3944       10121   2.57
> 2          8580       24524   2.86
> 4         16628       44426   2.67
> 8         46746       56716   1.21
> 16        79740      102966   1.29
> 20        80284      119502   1.49
>
> (the gain drop with >=8 cores needs to be investigated)

Hi Bernd,

Those are impressive results!

When approaching the FUSE uring feature from marketing POV,
I think that putting the emphasis on metadata operations is the
best approach.

Not the dio reads are not important (I know that is part of your use case),
but I imagine there are a lot more people out there waiting for
improvement in metadata operations overhead.

To me it helps to know what the current main pain points are
for people using FUSE filesystems wrt performance.

Although it may not be uptodate, the most comprehensive
study about FUSE performance overhead is this FAST17 paper:

https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

In this paper, table 3 summarizes the different overheads observed
per workload. According to this table, the workloads that degrade
performance worse on an optimized passthrough fs over SSD are:
- many file creates
- many file deletes
- many small file reads
In all these workloads, it was millions of files over many directories.
The highest performance regression reported was -83% on many
small file creations.

The moral of this long story is that it would be nice to know
what performance improvement FUSE uring can aspire to.
This is especially relevant for people that would be interested
in combining the benefits of FUSE passthrough (for data) and
FUSE uring (for metadata).

What did passthrough_hp do in your patched version with creates?
Did it actually create the files?
In how many directories?
Maybe the directory inode lock impeded performance improvement
with >=8 threads?

>
> Remaining TODO list for RFCv3:
> --------------------------------
> 1) Let the ring configure ioctl return information,
> like mmap/queue-buf size
>
> Right now libfuse and kernel have lots of duplicated setup code
> and any kind of pointer/offset mismatch results in a non-working
> ring that is hard to debug - probably better when the kernel does
> the calculations and returns that to server side
>
> 2) In combination with 1, ring requests should retrieve their
> userspace address and length from kernel side instead of
> calculating it through the mmaped queue buffer on their own.
> (Introduction of FUSE_URING_BUF_ADDR_FETCH)
>
> 3) Add log buffer into the ioctl and ring-request
>
> This is to provide better error messages (instead of just
> errno)
>
> 3) Multiple IO sizes per queue
>
> Small IOs and metadata requests do not need large buffer sizes,
> we need multiple IO sizes per queue.
>
> 4) FUSE_INTERRUPT handling
>
> These are not handled yet, kernel side is probably not difficult
> anymore as ring entries take fuse requests through lists.
>
> Long term TODO:
> --------------
> Notifications through io-uring, maybe with a separated ring,
> but I'm not sure yet.

Is that going to improve performance in any real life workload?

Thanks,
Amir.

>
> Changes since RFCv1
> -------------------
> - No need to hold the task of the server side anymore.  Also no
>   ioctls/threads waiting for shutdown anymore.  Shutdown now more
>   works like the traditional fuse way.
> - Each queue clones the fuse and device release makes an  exception
>   for io-uring. Reason is that queued IORING_OP_URING_CMD
>   (through .uring_cmd) prevent a device release. I.e. a killed
>   server side typically triggers fuse_abort_conn(). This was the
>   reason for the async stop-monitor in v1 and reference on the daemon
>   task. However it was very racy and annotated immediately by Miklos.
> - In v1 the offset parameter to mmap was identifying the QID, in v2
>   server side is expected to send mmap from a core bound ring thread
>   in numa mode and numa node is taken through the core of that thread.
>   Kernel side of the mmap buffer is stored in an rbtree and assigned
>   to the right qid through an additional queue ioctl.
> - Release of IORING_OP_URING_CMD is done through lists now, instead
>   of iterating over the entire array of queues/entries and does not
>   depend on the entry state anymore (a bit of the state is still left
>   for sanity check).
> - Finding free ring queue entries is done through lists and not through
>   a bitmap anymore
> - Many other code changes and bug fixes
> - Performance tunings
>
> ---
> Bernd Schubert (19):
>       fuse: rename to fuse_dev_end_requests and make non-static
>       fuse: Move fuse_get_dev to header file
>       fuse: Move request bits
>       fuse: Add fuse-io-uring design documentation
>       fuse: Add a uring config ioctl
>       Add a vmalloc_node_user function
>       fuse uring: Add an mmap method
>       fuse: Add the queue configuration ioctl
>       fuse: {uring} Add a dev_release exception for fuse-over-io-uring
>       fuse: {uring} Handle SQEs - register commands
>       fuse: Add support to copy from/to the ring buffer
>       fuse: {uring} Add uring sqe commit and fetch support
>       fuse: {uring} Handle uring shutdown
>       fuse: {uring} Allow to queue to the ring
>       export __wake_on_current_cpu
>       fuse: {uring} Wake requests on the the current cpu
>       fuse: {uring} Send async requests to qid of core + 1
>       fuse: {uring} Set a min cpu offset io-size for reads/writes
>       fuse: {uring} Optimize async sends
>
>  Documentation/filesystems/fuse-io-uring.rst |  167 ++++
>  fs/fuse/Kconfig                             |   12 +
>  fs/fuse/Makefile                            |    1 +
>  fs/fuse/dev.c                               |  310 +++++--
>  fs/fuse/dev_uring.c                         | 1232 +++++++++++++++++++++++++++
>  fs/fuse/dev_uring_i.h                       |  395 +++++++++
>  fs/fuse/file.c                              |   15 +-
>  fs/fuse/fuse_dev_i.h                        |   67 ++
>  fs/fuse/fuse_i.h                            |    9 +
>  fs/fuse/inode.c                             |    3 +
>  include/linux/vmalloc.h                     |    1 +
>  include/uapi/linux/fuse.h                   |  135 +++
>  kernel/sched/wait.c                         |    1 +
>  mm/nommu.c                                  |    6 +
>  mm/vmalloc.c                                |   41 +-
>  15 files changed, 2330 insertions(+), 65 deletions(-)
> ---
> base-commit: dd5a440a31fae6e459c0d6271dddd62825505361
> change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf
>
> Best regards,
> --
> Bernd Schubert <bschubert@ddn.com>
>

Bernd Schubert May 30, 2024, 12:09 p.m. UTC | #2

On 5/30/24 09:07, Amir Goldstein wrote:
> On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
>>
>> Motivation for these patches is all to increase fuse performance.
>> In fuse-over-io-uring requests avoid core switching (application
>> on core X, processing of fuse server on random core Y) and use
>> shared memory between kernel and userspace to transfer data.
>> Similar approaches have been taken by ZUFS and FUSE2, though
>> not over io-uring, but through ioctl IOs
>>
>> https://lwn.net/Articles/756625/
>> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2
>>
>> Avoiding cache line bouncing / numa systems was discussed
>> between Amir and Miklos before and Miklos had posted
>> part of the private discussion here
>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>
>> This cache line bouncing should be addressed by these patches
>> as well.
>>
>> I had also noticed waitq wake-up latencies in fuse before
>> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>>
>> This spinning approach helped with performance (>40% improvement
>> for file creates), but due to random server side thread/core utilization
>> spinning cannot be well controlled in /dev/fuse mode.
>> With fuse-over-io-uring requests are handled on the same core
>> (sync requests) or on core+1 (large async requests) and performance
>> improvements are achieved without spinning.
>>
>> Splice/zero-copy is not supported yet, Ming Lei is working
>> on io-uring support for ublk_drv, but I think so far there
>> is no final agreement on the approach to be taken yet.
>> Fuse-over-io-uring runs significantly faster than reads/writes
>> over /dev/fuse, even with splice enabled, so missing zc
>> should not be a blocking issue.
>>
>> The patches have been tested with multiple xfstest runs in a VM
>> (32 cores) with a kernel that has several debug options
>> enabled (like KASAN and MSAN).
>> For some tests xfstests reports that O_DIRECT is not supported,
>> I need to investigate that. Interesting part is that exactly
>> these tests fail in plain /dev/fuse posix mode. I had to disabled
>> generic/650, which is enabling/disabling cpu cores - given ring
>> threads are bound to cores issues with that are no totally
>> unexpected, but then there (scheduler) kernel messages that
>> core binding for these threads is removed - this needs
>> to be further investigates.
>> Nice effect in io-uring mode is that tests run faster (like
>> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
>> slow as this is with ASAN/leak-detection/etc.
>>
>> The corresponding libfuse patches are on my uring branch,
>> but need cleanup for submission - will happen during the next
>> days.
>> https://github.com/bsbernd/libfuse/tree/uring
>>
>> If it should make review easier, patches posted here are on
>> this branch
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2
>>
>> TODO list for next RFC versions
>> - Let the ring configure ioctl return information, like mmap/queue-buf size
>> - Request kernel side address and len for a request - avoid calculation in userspace?
>> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even
>>   more important)
>> - FUSE_INTERRUPT handling?
>> - Logging (adds fields in the ioctl and also ring-request),
>>   any mismatch between client and server is currently very hard to understand
>>   through error codes
>>
>> Future work
>> - notifications, probably on their own ring
>> - zero copy
>>
>> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
>> which, resulted in some tuning patches (at the end of the
>> patch series).
>>
>> Some benchmark results
>> ======================
>>
>> System used for the benchmark is a 32 core (HyperThreading enabled)
>> Xeon E5-2650 system. I don't have local disks attached that could do
>>> 5GB/s IOs, for paged and dio results a patched version of passthrough-hp
>> was used that bypasses final reads/writes.
>>
>> paged reads
>> -----------
>>             128K IO size                      1024K IO size
>> jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
>>  1        1117        1921    1.72        1902       1942   1.02
>>  2        2502        3527    1.41        3066       3260   1.06
>>  4        5052        6125    1.21        5994       6097   1.02
>>  8        6273       10855    1.73        7101      10491   1.48
>> 16        6373       11320    1.78        7660      11419   1.49
>> 24        6111        9015    1.48        7600       9029   1.19
>> 32        5725        7968    1.39        6986       7961   1.14
>>
>> dio reads (1024K)
>> -----------------
>>
>> jobs   /dev/fuse  uring   gain
>> 1           2023   3998   2.42
>> 2           3375   7950   2.83
>> 4           3823   15022  3.58
>> 8           7796   22591  2.77
>> 16          8520   27864  3.27
>> 24          8361   20617  2.55
>> 32          8717   12971  1.55
>>
>> mmap reads (4K)
>> ---------------
>> (sequential, I probably should have made it random, sequential exposes
>> a rather interesting/weird 'optimized' memcpy issue - sequential becomes
>> reversed order 4K read)
>> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/
>>
>> jobs  /dev/fuse     uring    gain
>> 1       130          323     2.49
>> 2       219          538     2.46
>> 4       503         1040     2.07
>> 8       1472        2039     1.38
>> 16      2191        3518     1.61
>> 24      2453        4561     1.86
>> 32      2178        5628     2.58
>>
>> (Results on request, setting MAP_HUGETLB much improves performance
>> for both, io-uring mode then has a slight advantage only.)
>>
>> creates/s
>> ----------
>> threads /dev/fuse     uring   gain
>> 1          3944       10121   2.57
>> 2          8580       24524   2.86
>> 4         16628       44426   2.67
>> 8         46746       56716   1.21
>> 16        79740      102966   1.29
>> 20        80284      119502   1.49
>>
>> (the gain drop with >=8 cores needs to be investigated)
> 

Hi Amir,

> Hi Bernd,
> 
> Those are impressive results!

thank you!


> 
> When approaching the FUSE uring feature from marketing POV,
> I think that putting the emphasis on metadata operations is the
> best approach.

I can add in some more results and probably need to redo at least the
metadata tests. I have all the results in google docs and in plain text
files, just a bit cumbersome maybe also spam to post all of it here.

> 
> Not the dio reads are not important (I know that is part of your use case),
> but I imagine there are a lot more people out there waiting for
> improvement in metadata operations overhead.

I think the DIO use case is declining. My fuse work is now related to
the DDN Infina project, which has a DLM - this will all go via cache and
notifications (into from/to client/server) I need to start to work on
that asap... I'm also not too happy yet about cached writes/reads - need
to find time to investigate where the limit is.

> 
> To me it helps to know what the current main pain points are
> for people using FUSE filesystems wrt performance.
> 
> Although it may not be uptodate, the most comprehensive
> study about FUSE performance overhead is this FAST17 paper:
> 
> https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf

Yeah, I had seen it. Just checking again, interesting is actually their
instrumentation branch

https://github.com/sbu-fsl/fuse-kernel-instrumentation

This should be very useful upstream, in combination with Josefs fuse
tracepoints (btw, thanks for the tracepoint patch Josef! I'm going to
look at it and test it tomorrow).


> 
> In this paper, table 3 summarizes the different overheads observed
> per workload. According to this table, the workloads that degrade
> performance worse on an optimized passthrough fs over SSD are:
> - many file creates
> - many file deletes
> - many small file reads
> In all these workloads, it was millions of files over many directories.
> The highest performance regression reported was -83% on many
> small file creations.
> 
> The moral of this long story is that it would be nice to know
> what performance improvement FUSE uring can aspire to.
> This is especially relevant for people that would be interested
> in combining the benefits of FUSE passthrough (for data) and
> FUSE uring (for metadata).

As written above, I can add a few more data. But if possible I wouldn't
like to concentrate on benchmarking - this can be super time consuming
and doesn't help unless one investigates what is actually limiting
performance. Right now we see that io-uring helps, fixing the other
limits is then the next step, imho.

> 
> What did passthrough_hp do in your patched version with creates?
> Did it actually create the files?

Yeah, it creates files, I think on xfs (or ext4). I had tried tmpfs
first, but it had issues with seekdir/telldir until recently - will
switch back to tmpfs for next tests.

> In how many directories?
> Maybe the directory inode lock impeded performance improvement
> with >=8 threads?

I don't think the directory inode lock is an issue - this should be one
(or more directories) per thread

Basically

/usr/lib64/openmpi/bin/mpirun \
            --mca btl self -n $i --oversubscribe \
            ./mdtest -F -n40000 -i1 \
                -d /scratch/dest -u -b2 | tee ${fname}-$i.out


(mdtest is really convenient for meta operations, although requires mpi,
recent versions are here (the initial LLNL project merged with ior).

https://github.com/hpc/ior

"-F"
Perform test on files only (no directories).

"-n" number_of_items
Every process will creat/stat/remove # directories and files

"-i" iterations
The number of iterations the test will run

"-u"
Create a unique working directory for each task

"-b" branching_factor
The branching factor of the hierarchical directory structure [default: 1].


(The older LLNL repo has a better mdtest README
https://github.com/LLNL/mdtest)


Also, regarding metadata, I definitely need to find time resume work on
atomic-open. Besides performance, there is another use case
https://github.com/libfuse/libfuse/issues/945. Sweet Tea Dorminy / Josef
also seem to need that.

> 
>>
>> Remaining TODO list for RFCv3:
>> --------------------------------
>> 1) Let the ring configure ioctl return information,
>> like mmap/queue-buf size
>>
>> Right now libfuse and kernel have lots of duplicated setup code
>> and any kind of pointer/offset mismatch results in a non-working
>> ring that is hard to debug - probably better when the kernel does
>> the calculations and returns that to server side
>>
>> 2) In combination with 1, ring requests should retrieve their
>> userspace address and length from kernel side instead of
>> calculating it through the mmaped queue buffer on their own.
>> (Introduction of FUSE_URING_BUF_ADDR_FETCH)
>>
>> 3) Add log buffer into the ioctl and ring-request
>>
>> This is to provide better error messages (instead of just
>> errno)
>>
>> 3) Multiple IO sizes per queue
>>
>> Small IOs and metadata requests do not need large buffer sizes,
>> we need multiple IO sizes per queue.
>>
>> 4) FUSE_INTERRUPT handling
>>
>> These are not handled yet, kernel side is probably not difficult
>> anymore as ring entries take fuse requests through lists.
>>
>> Long term TODO:
>> --------------
>> Notifications through io-uring, maybe with a separated ring,
>> but I'm not sure yet.
> 
> Is that going to improve performance in any real life workload?
> 


I'm rather sure that we at DDN will need it for our project with the
DLM. I have other priorities for now - once it comes up, adding
notifications over uring shouldn't be difficult.



Thanks,
Bernd

Kent Overstreet May 30, 2024, 3:36 p.m. UTC | #3

On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
> 
> Motivation for these patches is all to increase fuse performance.
> In fuse-over-io-uring requests avoid core switching (application
> on core X, processing of fuse server on random core Y) and use
> shared memory between kernel and userspace to transfer data.
> Similar approaches have been taken by ZUFS and FUSE2, though
> not over io-uring, but through ioctl IOs

What specifically is it about io-uring that's helpful here? Besides the
ringbuffer?

So the original mess was that because we didn't have a generic
ringbuffer, we had aio, tracing, and god knows what else all
implementing their own special purpose ringbuffers (all with weird
quirks of debatable or no usefulness).

It seems to me that what fuse (and a lot of other things want) is just a
clean simple easy to use generic ringbuffer for sending what-have-you
back and forth between the kernel and userspace - in this case RPCs from
the kernel to userspace.

But instead, the solution seems to be just toss everything into a new
giant subsystem?

Bernd Schubert May 30, 2024, 4:02 p.m. UTC | #4

On 5/30/24 17:36, Kent Overstreet wrote:
> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
>>
>> Motivation for these patches is all to increase fuse performance.
>> In fuse-over-io-uring requests avoid core switching (application
>> on core X, processing of fuse server on random core Y) and use
>> shared memory between kernel and userspace to transfer data.
>> Similar approaches have been taken by ZUFS and FUSE2, though
>> not over io-uring, but through ioctl IOs
> 
> What specifically is it about io-uring that's helpful here? Besides the
> ringbuffer?
> 
> So the original mess was that because we didn't have a generic
> ringbuffer, we had aio, tracing, and god knows what else all
> implementing their own special purpose ringbuffers (all with weird
> quirks of debatable or no usefulness).
> 
> It seems to me that what fuse (and a lot of other things want) is just a
> clean simple easy to use generic ringbuffer for sending what-have-you
> back and forth between the kernel and userspace - in this case RPCs from
> the kernel to userspace.
> 
> But instead, the solution seems to be just toss everything into a new
> giant subsystem?

Hmm, initially I had thought about writing my own ring buffer, but then 
io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
need? From interface point of view, io-uring seems easy to use here, 
has everything we need and kind of the same thing is used for ublk - 
what speaks against io-uring? And what other suggestion do you have?

I guess the same concern would also apply to ublk_drv. 

Well, decoupling from io-uring might help to get for zero-copy, as there
doesn't seem to be an agreement with Mings approaches (sorry I'm only
silently following for now).

From our side, a customer has pointed out security concerns for io-uring. 
My thinking so far was to implemented the required io-uring pieces into 
an module and access it with ioctls... Which would also allow to
backport it to RHEL8/RHEL9.

Thanks,
Bernd

Kent Overstreet May 30, 2024, 4:10 p.m. UTC | #5

On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> Hmm, initially I had thought about writing my own ring buffer, but then 
> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> need? From interface point of view, io-uring seems easy to use here, 
> has everything we need and kind of the same thing is used for ublk - 
> what speaks against io-uring? And what other suggestion do you have?
> 
> I guess the same concern would also apply to ublk_drv. 
> 
> Well, decoupling from io-uring might help to get for zero-copy, as there
> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> silently following for now).
> 
> From our side, a customer has pointed out security concerns for io-uring. 
> My thinking so far was to implemented the required io-uring pieces into 
> an module and access it with ioctls... Which would also allow to
> backport it to RHEL8/RHEL9.

Well, I've been starting to sketch out a ringbuffer() syscall, which
would work on any (supported) file descriptor and give you a ringbuffer
for reading or writing (or call it twice for both).

That seems to be what fuse really wants, no? You're already using a file
descriptor and your own RPC format, you just want a faster
communications channel.

Bernd Schubert May 30, 2024, 4:17 p.m. UTC | #6

On 5/30/24 18:10, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
>> Hmm, initially I had thought about writing my own ring buffer, but then 
>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>> need? From interface point of view, io-uring seems easy to use here, 
>> has everything we need and kind of the same thing is used for ublk - 
>> what speaks against io-uring? And what other suggestion do you have?
>>
>> I guess the same concern would also apply to ublk_drv. 
>>
>> Well, decoupling from io-uring might help to get for zero-copy, as there
>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>> silently following for now).
>>
>> From our side, a customer has pointed out security concerns for io-uring. 
>> My thinking so far was to implemented the required io-uring pieces into 
>> an module and access it with ioctls... Which would also allow to
>> backport it to RHEL8/RHEL9.
> 
> Well, I've been starting to sketch out a ringbuffer() syscall, which
> would work on any (supported) file descriptor and give you a ringbuffer
> for reading or writing (or call it twice for both).
> 
> That seems to be what fuse really wants, no? You're already using a file
> descriptor and your own RPC format, you just want a faster
> communications channel.

Fine with me, if you have something better/simpler with less security
concerns - why not. We just need a community agreement on that.

Do you have something I could look at?

Thanks,
Bernd

Jens Axboe May 30, 2024, 4:21 p.m. UTC | #7

On 5/30/24 10:02 AM, Bernd Schubert wrote:
> 
> 
> On 5/30/24 17:36, Kent Overstreet wrote:
>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>> From: Bernd Schubert <bschubert@ddn.com>
>>>
>>> This adds support for uring communication between kernel and
>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>> appraoch was taken from ublk.  The patches are in RFC state,
>>> some major changes are still to be expected.
>>>
>>> Motivation for these patches is all to increase fuse performance.
>>> In fuse-over-io-uring requests avoid core switching (application
>>> on core X, processing of fuse server on random core Y) and use
>>> shared memory between kernel and userspace to transfer data.
>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>> not over io-uring, but through ioctl IOs
>>
>> What specifically is it about io-uring that's helpful here? Besides the
>> ringbuffer?
>>
>> So the original mess was that because we didn't have a generic
>> ringbuffer, we had aio, tracing, and god knows what else all
>> implementing their own special purpose ringbuffers (all with weird
>> quirks of debatable or no usefulness).
>>
>> It seems to me that what fuse (and a lot of other things want) is just a
>> clean simple easy to use generic ringbuffer for sending what-have-you
>> back and forth between the kernel and userspace - in this case RPCs from
>> the kernel to userspace.
>>
>> But instead, the solution seems to be just toss everything into a new
>> giant subsystem?
> 
> 
> Hmm, initially I had thought about writing my own ring buffer, but then 
> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> need? From interface point of view, io-uring seems easy to use here, 
> has everything we need and kind of the same thing is used for ublk - 
> what speaks against io-uring? And what other suggestion do you have?
> 
> I guess the same concern would also apply to ublk_drv. 
> 
> Well, decoupling from io-uring might help to get for zero-copy, as there
> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> silently following for now).

If you have an interest in the zero copy, do chime in, it would
certainly help get some closure on that feature. I don't think anyone
disagrees it's a useful and needed feature, but there are different view
points on how it's best solved.

> From our side, a customer has pointed out security concerns for io-uring. 

That's just bs and fud these days.

Bernd Schubert May 30, 2024, 4:32 p.m. UTC | #8

On 5/30/24 18:21, Jens Axboe wrote:
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>
>>
>> On 5/30/24 17:36, Kent Overstreet wrote:
>>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>>> From: Bernd Schubert <bschubert@ddn.com>
>>>>
>>>> This adds support for uring communication between kernel and
>>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>>> appraoch was taken from ublk.  The patches are in RFC state,
>>>> some major changes are still to be expected.
>>>>
>>>> Motivation for these patches is all to increase fuse performance.
>>>> In fuse-over-io-uring requests avoid core switching (application
>>>> on core X, processing of fuse server on random core Y) and use
>>>> shared memory between kernel and userspace to transfer data.
>>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>>> not over io-uring, but through ioctl IOs
>>>
>>> What specifically is it about io-uring that's helpful here? Besides the
>>> ringbuffer?
>>>
>>> So the original mess was that because we didn't have a generic
>>> ringbuffer, we had aio, tracing, and god knows what else all
>>> implementing their own special purpose ringbuffers (all with weird
>>> quirks of debatable or no usefulness).
>>>
>>> It seems to me that what fuse (and a lot of other things want) is just a
>>> clean simple easy to use generic ringbuffer for sending what-have-you
>>> back and forth between the kernel and userspace - in this case RPCs from
>>> the kernel to userspace.
>>>
>>> But instead, the solution seems to be just toss everything into a new
>>> giant subsystem?
>>
>>
>> Hmm, initially I had thought about writing my own ring buffer, but then 
>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>> need? From interface point of view, io-uring seems easy to use here, 
>> has everything we need and kind of the same thing is used for ublk - 
>> what speaks against io-uring? And what other suggestion do you have?
>>
>> I guess the same concern would also apply to ublk_drv. 
>>
>> Well, decoupling from io-uring might help to get for zero-copy, as there
>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>> silently following for now).
> 
> If you have an interest in the zero copy, do chime in, it would
> certainly help get some closure on that feature. I don't think anyone
> disagrees it's a useful and needed feature, but there are different view
> points on how it's best solved.

We had a bit of discussion with Ming about that last year, besides that
I got busy with other parts, it got a bit less of personal interest for
me as our project really needs to access the buffer (additional
checksums, sending it out over network library (libfabric), possibly
even preprocessing of some data) - I think it makes sense if I work on
the other fuse parts first and only come back zero copy a bit later.

> 
>> From our side, a customer has pointed out security concerns for io-uring. 
> 
> That's just bs and fud these days.

I wasn't in contact with that customer personally, I had just seen their
email.It would probably help if RHEL would eventually gain io-uring
support - almost all of HPC systems are using it or a clone. I was
always hoping that RHEL would get it before I'm done with
fuse-over-io-uring, now I'm not so sure anymore.


Thanks,
Bernd

Kent Overstreet May 30, 2024, 5:16 p.m. UTC | #9

On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote:
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
> > From our side, a customer has pointed out security concerns for io-uring. 
> 
> That's just bs and fud these days.

You have a history of being less than responsive with bug reports, and
this sort of attitude is not the attitude of a responsible maintainer.

From what I've seen those concerns were well founded, so if you want to
be taking seriously I'd be talking about what was done to address them
instead of namecalling.

Jens Axboe May 30, 2024, 5:26 p.m. UTC | #10

On 5/30/24 10:32 AM, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:21, Jens Axboe wrote:
>> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>>
>>>
>>> On 5/30/24 17:36, Kent Overstreet wrote:
>>>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
>>>>> From: Bernd Schubert <bschubert@ddn.com>
>>>>>
>>>>> This adds support for uring communication between kernel and
>>>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>>>>> appraoch was taken from ublk.  The patches are in RFC state,
>>>>> some major changes are still to be expected.
>>>>>
>>>>> Motivation for these patches is all to increase fuse performance.
>>>>> In fuse-over-io-uring requests avoid core switching (application
>>>>> on core X, processing of fuse server on random core Y) and use
>>>>> shared memory between kernel and userspace to transfer data.
>>>>> Similar approaches have been taken by ZUFS and FUSE2, though
>>>>> not over io-uring, but through ioctl IOs
>>>>
>>>> What specifically is it about io-uring that's helpful here? Besides the
>>>> ringbuffer?
>>>>
>>>> So the original mess was that because we didn't have a generic
>>>> ringbuffer, we had aio, tracing, and god knows what else all
>>>> implementing their own special purpose ringbuffers (all with weird
>>>> quirks of debatable or no usefulness).
>>>>
>>>> It seems to me that what fuse (and a lot of other things want) is just a
>>>> clean simple easy to use generic ringbuffer for sending what-have-you
>>>> back and forth between the kernel and userspace - in this case RPCs from
>>>> the kernel to userspace.
>>>>
>>>> But instead, the solution seems to be just toss everything into a new
>>>> giant subsystem?
>>>
>>>
>>> Hmm, initially I had thought about writing my own ring buffer, but then 
>>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
>>> need? From interface point of view, io-uring seems easy to use here, 
>>> has everything we need and kind of the same thing is used for ublk - 
>>> what speaks against io-uring? And what other suggestion do you have?
>>>
>>> I guess the same concern would also apply to ublk_drv. 
>>>
>>> Well, decoupling from io-uring might help to get for zero-copy, as there
>>> doesn't seem to be an agreement with Mings approaches (sorry I'm only
>>> silently following for now).
>>
>> If you have an interest in the zero copy, do chime in, it would
>> certainly help get some closure on that feature. I don't think anyone
>> disagrees it's a useful and needed feature, but there are different view
>> points on how it's best solved.
> 
> We had a bit of discussion with Ming about that last year, besides that
> I got busy with other parts, it got a bit less of personal interest for
> me as our project really needs to access the buffer (additional
> checksums, sending it out over network library (libfabric), possibly
> even preprocessing of some data) - I think it makes sense if I work on
> the other fuse parts first and only come back zero copy a bit later.

Ah I see - yes if you're going to be touching the data anyway, zero copy
is less of a concern. Some memory bandwidth can still be saved if you're
not touching all of it, of course. But if you are, you're probably
better off copying it in the first place.

>>> From our side, a customer has pointed out security concerns for io-uring. 
>>
>> That's just bs and fud these days.
> 
> I wasn't in contact with that customer personally, I had just seen their
> email.It would probably help if RHEL would eventually gain io-uring
> support - almost all of HPC systems are using it or a clone. I was
> always hoping that RHEL would get it before I'm done with
> fuse-over-io-uring, now I'm not so sure anymore.

Not sure what the RHEL status is. I know backports are done on the
io_uring side, but not sure what base they are currently on. I strongly
suspect that would be a gating factor for getting it enabled. If it's
too out of date, then performance isn't going to be as good as current
mainline anyway.

Jens Axboe May 30, 2024, 5:28 p.m. UTC | #11

On 5/30/24 11:16 AM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote:
>> On 5/30/24 10:02 AM, Bernd Schubert wrote:
>>> From our side, a customer has pointed out security concerns for io-uring. 
>>
>> That's just bs and fud these days.
> 
> You have a history of being less than responsive with bug reports, and
> this sort of attitude is not the attitude of a responsible maintainer.

Ok... That's a bold claim. We actually tend to bug reports quickly and
get them resolved in a timely manner. Maybe I've been less responsive on
a bug report from you, but that's usually because the emails turn out
like this one, with odd and unwarranted claims. Not taking the bait.

If you're referring to the file reference and umount issue, yes I do
very much want to get that one resolved. I do have patches for that, but
was never quite happy with them. As it isn't a stability or safety
concern, and not a practical concern outside of the test case in
question, it hasn't been super high on the radar unfortunately.

> From what I've seen those concerns were well founded, so if you want to
> be taking seriously I'd be talking about what was done to address them
> instead of namecalling.

I have addressed it several times in the past. tldr is that yeah the
initial history of io_uring wasn't great, due to some unfortunate
initial design choices (mostly around async worker setup and
identities). Those have since been rectified, and the code base is
stable and solid these days.

Kent Overstreet May 30, 2024, 5:30 p.m. UTC | #12

On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:10, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> >> Hmm, initially I had thought about writing my own ring buffer, but then 
> >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> >> need? From interface point of view, io-uring seems easy to use here, 
> >> has everything we need and kind of the same thing is used for ublk - 
> >> what speaks against io-uring? And what other suggestion do you have?
> >>
> >> I guess the same concern would also apply to ublk_drv. 
> >>
> >> Well, decoupling from io-uring might help to get for zero-copy, as there
> >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> >> silently following for now).
> >>
> >> From our side, a customer has pointed out security concerns for io-uring. 
> >> My thinking so far was to implemented the required io-uring pieces into 
> >> an module and access it with ioctls... Which would also allow to
> >> backport it to RHEL8/RHEL9.
> > 
> > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > would work on any (supported) file descriptor and give you a ringbuffer
> > for reading or writing (or call it twice for both).
> > 
> > That seems to be what fuse really wants, no? You're already using a file
> > descriptor and your own RPC format, you just want a faster
> > communications channel.
> 
> Fine with me, if you have something better/simpler with less security
> concerns - why not. We just need a community agreement on that.
> 
> Do you have something I could look at?

Like I said it's at the early sketch stage, I haven't written any code
yet. But I'm envisioning something very simple - just a syscall that
gives you a mapped buffer of a specified size with head and tail pointers.

But this has been kicking around for awhile, so if you're interested I
could probably have something for you to try out in the next few days.

Kent Overstreet May 30, 2024, 5:58 p.m. UTC | #13

On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
> I have addressed it several times in the past. tldr is that yeah the
> initial history of io_uring wasn't great, due to some unfortunate
> initial design choices (mostly around async worker setup and
> identities).

Not to pick on you too much but the initial history looked pretty messy
to me - a lot of layering violations - it made aio.c look clean.

I know you were in "get shit done" mode, but at some point we have to
take a step back and ask "what are the different core concepts being
expressed here, and can we start picking them apart?". A generic
ringbuffer would be a good place to start.

I'd also really like to see some more standardized mechanisms for "I'm a
kernel thread doing work on behalf of some other user thread" - this
comes up elsewhere, I'm talking with David Howells right now about
fsconfig which is another place it is or will be coming up.

> Those have since been rectified, and the code base is
> stable and solid these days.

good tests, code coverage analysis to verify, good syzbot coverage?

Jens Axboe May 30, 2024, 6:48 p.m. UTC | #14

On 5/30/24 11:58 AM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
>> I have addressed it several times in the past. tldr is that yeah the
>> initial history of io_uring wasn't great, due to some unfortunate
>> initial design choices (mostly around async worker setup and
>> identities).
> 
> Not to pick on you too much but the initial history looked pretty messy
> to me - a lot of layering violations - it made aio.c look clean.

Oh I certainly agree, the initial code was in a much worse state than it
is in now. Lots of things have happened there, like splitting things up
and adding appropriate layering. That was more of a code hygiene kind of
thing, to make it easier to understand, maintain, and develop.

Any new subsystem is going to see lots of initial churn, regardless of
how long it's been developed before going into upstream. We certainly
had lots of churn, where these days it's stabilized. I don't think
that's unusual, particularly for something that attempts to do certain
things very differently. I would've loved to start with our current
state, but I don't think years of being out of tree would've completely
solved that. Some things you just don't find until it's in tree,
unfortunately.

> I know you were in "get shit done" mode, but at some point we have to
> take a step back and ask "what are the different core concepts being
> expressed here, and can we start picking them apart?". A generic
> ringbuffer would be a good place to start.
> 
> I'd also really like to see some more standardized mechanisms for "I'm a
> kernel thread doing work on behalf of some other user thread" - this
> comes up elsewhere, I'm talking with David Howells right now about
> fsconfig which is another place it is or will be coming up.

That does exist, and it came from the io_uring side of needing exactly
that. This is why we have create_io_thread(). IMHO it's the only sane
way to do it, trying to guesstimate what happens deep down in a random
callstack, and setting things up appropriately, is impossible. This is
where most of the earlier day io_uring issues came from, and what I
referred to as a "poor initial design choice".

>> Those have since been rectified, and the code base is
>> stable and solid these days.
> 
> good tests, code coverage analysis to verify, good syzbot coverage?

3x yes. Obviously I'm always going to say that tests could be better,
have better coverage, cover more things, because nothing is perfect (and
if you think it is, you're fooling yourself) and as a maintainer I want
perfect coverage. But we're pretty diligent these days about adding
tests for everything. And any regression or bug report always gets test
cases written.

Josef Bacik May 30, 2024, 7:09 p.m. UTC | #15

On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> 
> 
> On 5/30/24 18:10, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> >> Hmm, initially I had thought about writing my own ring buffer, but then 
> >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> >> need? From interface point of view, io-uring seems easy to use here, 
> >> has everything we need and kind of the same thing is used for ublk - 
> >> what speaks against io-uring? And what other suggestion do you have?
> >>
> >> I guess the same concern would also apply to ublk_drv. 
> >>
> >> Well, decoupling from io-uring might help to get for zero-copy, as there
> >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> >> silently following for now).
> >>
> >> From our side, a customer has pointed out security concerns for io-uring. 
> >> My thinking so far was to implemented the required io-uring pieces into 
> >> an module and access it with ioctls... Which would also allow to
> >> backport it to RHEL8/RHEL9.
> > 
> > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > would work on any (supported) file descriptor and give you a ringbuffer
> > for reading or writing (or call it twice for both).
> > 
> > That seems to be what fuse really wants, no? You're already using a file
> > descriptor and your own RPC format, you just want a faster
> > communications channel.
> 
> Fine with me, if you have something better/simpler with less security
> concerns - why not. We just need a community agreement on that.
> 
> Do you have something I could look at?

FWIW I have no strong feelings between using iouring vs any other ringbuffer
mechanism we come up with in the future.

That being said iouring is here now, is proven to work, and these are good
performance improvements.  If in the future something else comes along that
gives us better performance then absolutely we should explore adding that
functionality.  But this solves the problem today, and I need the problem solved
yesterday, so continuing with this patchset is very much a worthwhile
investment, one that I'm very happy you're tackling Bernd instead of me ;).
Thanks,

Josef

Kent Overstreet May 30, 2024, 7:35 p.m. UTC | #16

On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote:
> On 5/30/24 11:58 AM, Kent Overstreet wrote:
> > On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
> >> I have addressed it several times in the past. tldr is that yeah the
> >> initial history of io_uring wasn't great, due to some unfortunate
> >> initial design choices (mostly around async worker setup and
> >> identities).
> > 
> > Not to pick on you too much but the initial history looked pretty messy
> > to me - a lot of layering violations - it made aio.c look clean.
> 
> Oh I certainly agree, the initial code was in a much worse state than it
> is in now. Lots of things have happened there, like splitting things up
> and adding appropriate layering. That was more of a code hygiene kind of
> thing, to make it easier to understand, maintain, and develop.
> 
> Any new subsystem is going to see lots of initial churn, regardless of
> how long it's been developed before going into upstream. We certainly
> had lots of churn, where these days it's stabilized. I don't think
> that's unusual, particularly for something that attempts to do certain
> things very differently. I would've loved to start with our current
> state, but I don't think years of being out of tree would've completely
> solved that. Some things you just don't find until it's in tree,
> unfortunately.

Well, the main thing I would've liked is a bit more discussion in the
early days of io_uring; there are things we could've done differently
back then that could've got us something cleaner in the long run.

My main complaints were always
 - yet another special purpose ringbuffer, and
 - yet another parallel syscall interface.

We've got too many of those too (aio is another), and the API
fragmentation is a real problem for userspace that just wants to be able
to issue arbitrary syscalls asynchronously. IO uring could've just been
serializing syscall numbers and arguments - that would have worked fine.

Given the history of failed AIO replacements just wanting to shove in
something working was understandable, but I don't think those would have
been big asks.

> > I'd also really like to see some more standardized mechanisms for "I'm a
> > kernel thread doing work on behalf of some other user thread" - this
> > comes up elsewhere, I'm talking with David Howells right now about
> > fsconfig which is another place it is or will be coming up.
> 
> That does exist, and it came from the io_uring side of needing exactly
> that. This is why we have create_io_thread(). IMHO it's the only sane
> way to do it, trying to guesstimate what happens deep down in a random
> callstack, and setting things up appropriately, is impossible. This is
> where most of the earlier day io_uring issues came from, and what I
> referred to as a "poor initial design choice".

Thanks, I wasn't aware of that - that's worth highlighting. I may switch
thread_with_file to that, and the fsconfig work David and I are talking
about can probably use it as well.

We really should have something lighter weight that we can use for work
items though, that's our standard mechanism for deferred work, not
spinning up a whole kthread. We do have kthread_use_mm() - there's no
reason we couldn't do an expanded version of that for all the other
shared resources that need to be available.

This was also another blocker in the other aborted AIO replacements, so
reusing clone flags does seem like the most reasonable way to make
progress here, but I would wager there's other stuff in task_struct that
really should be shared and isn't. task_struct is up to 825 (!) lines
now, which means good luck on even finding that stuff - maybe at some
point we'll have to get a giant effort going to clean up and organize
task_struct, like willy's been doing with struct page.

> >> Those have since been rectified, and the code base is
> >> stable and solid these days.
> > 
> > good tests, code coverage analysis to verify, good syzbot coverage?
> 
> 3x yes. Obviously I'm always going to say that tests could be better,
> have better coverage, cover more things, because nothing is perfect (and
> if you think it is, you're fooling yourself) and as a maintainer I want
> perfect coverage. But we're pretty diligent these days about adding
> tests for everything. And any regression or bug report always gets test
> cases written.

*nod* that's encouraging. Looking forward to the umount issue being
fixed so I can re-enable it in my tests...

Kent Overstreet May 30, 2024, 8:05 p.m. UTC | #17

On Thu, May 30, 2024 at 03:09:41PM -0400, Josef Bacik wrote:
> On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote:
> > 
> > 
> > On 5/30/24 18:10, Kent Overstreet wrote:
> > > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote:
> > >> Hmm, initially I had thought about writing my own ring buffer, but then 
> > >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> > >> need? From interface point of view, io-uring seems easy to use here, 
> > >> has everything we need and kind of the same thing is used for ublk - 
> > >> what speaks against io-uring? And what other suggestion do you have?
> > >>
> > >> I guess the same concern would also apply to ublk_drv. 
> > >>
> > >> Well, decoupling from io-uring might help to get for zero-copy, as there
> > >> doesn't seem to be an agreement with Mings approaches (sorry I'm only
> > >> silently following for now).
> > >>
> > >> From our side, a customer has pointed out security concerns for io-uring. 
> > >> My thinking so far was to implemented the required io-uring pieces into 
> > >> an module and access it with ioctls... Which would also allow to
> > >> backport it to RHEL8/RHEL9.
> > > 
> > > Well, I've been starting to sketch out a ringbuffer() syscall, which
> > > would work on any (supported) file descriptor and give you a ringbuffer
> > > for reading or writing (or call it twice for both).
> > > 
> > > That seems to be what fuse really wants, no? You're already using a file
> > > descriptor and your own RPC format, you just want a faster
> > > communications channel.
> > 
> > Fine with me, if you have something better/simpler with less security
> > concerns - why not. We just need a community agreement on that.
> > 
> > Do you have something I could look at?
> 
> FWIW I have no strong feelings between using iouring vs any other ringbuffer
> mechanism we come up with in the future.
> 
> That being said iouring is here now, is proven to work, and these are good
> performance improvements.  If in the future something else comes along that
> gives us better performance then absolutely we should explore adding that
> functionality.  But this solves the problem today, and I need the problem solved
> yesterday, so continuing with this patchset is very much a worthwhile
> investment, one that I'm very happy you're tackling Bernd instead of me ;).
> Thanks,

I suspect a ringbuffer syscall will actually be simpler than switching
to io_uring. Let me see if I can cook something up quickly - there's no
rocket science here and this all stuff we've done before so it shouldn't
take too long (famous last works...)

Josef Bacik May 30, 2024, 8:47 p.m. UTC | #18

On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> From: Bernd Schubert <bschubert@ddn.com>
> 
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.
> 

First, thanks for tackling this, this is a pretty big project and pretty
important, you've put a lot of work into this and it's pretty great.

A few things that I've pointed out elsewhere, but bear repeating and keeping in
mind for the entire patch series.

1. Make sure you've got changelogs.  There's several patches that just don't
   have changelogs.  I get things where it's like "add a mmap interface", but it
   would be good to have an explanation as to why you're adding it and what we
   hope to get out of that change.

2. Again as I stated elsewhere, you add a bunch of structs and stuff that aren't
   related to the current patch, which makes it difficult for me to work out
   what's needed or how it's used, so I go back and forth between the code and
   the patch a lot, and I've probably missed a few things.

3. Drop the CPU scheduling changes for this first pass.  The performance
   optimizations are definitely worth pursuing, but I don't want to get hung up
   in waiting on the scheduler dependencies to land.  Additionally what looks
   like it works in your setup may not necessarily be good for everybody's
   setup.  Having the baseline stuff in and working well, and then providing
   patches to change the CPU stuff in a way that we can test in a variety of
   different environments to validate the wins would be better.  As someone who
   has to regularly go figure out what in the scheduler changed to make all of
   our metrics look bad again, I'm very wary of changes that make CPU selection
   policy decisions in a way that aren't changeable without changing the code.

Thanks,

Josef

Jens Axboe May 31, 2024, 12:11 a.m. UTC | #19

On 5/30/24 1:35 PM, Kent Overstreet wrote:
> On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote:
>> On 5/30/24 11:58 AM, Kent Overstreet wrote:
>>> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote:
>>>> I have addressed it several times in the past. tldr is that yeah the
>>>> initial history of io_uring wasn't great, due to some unfortunate
>>>> initial design choices (mostly around async worker setup and
>>>> identities).
>>>
>>> Not to pick on you too much but the initial history looked pretty messy
>>> to me - a lot of layering violations - it made aio.c look clean.
>>
>> Oh I certainly agree, the initial code was in a much worse state than it
>> is in now. Lots of things have happened there, like splitting things up
>> and adding appropriate layering. That was more of a code hygiene kind of
>> thing, to make it easier to understand, maintain, and develop.
>>
>> Any new subsystem is going to see lots of initial churn, regardless of
>> how long it's been developed before going into upstream. We certainly
>> had lots of churn, where these days it's stabilized. I don't think
>> that's unusual, particularly for something that attempts to do certain
>> things very differently. I would've loved to start with our current
>> state, but I don't think years of being out of tree would've completely
>> solved that. Some things you just don't find until it's in tree,
>> unfortunately.
> 
> Well, the main thing I would've liked is a bit more discussion in the
> early days of io_uring; there are things we could've done differently
> back then that could've got us something cleaner in the long run.

No matter how much discussion would've been had, there always would've
been compromises or realizations that "yeah that thing there should've
been done differently". In any case, pointless to argue about that, as
the only thing we can change is how things look in the present and
future. At least I don't have a time machine.

> My main complaints were always
>  - yet another special purpose ringbuffer, and
>  - yet another parallel syscall interface.

Exactly how many "parallel syscall interfaces" do we have?

> We've got too many of those too (aio is another), and the API

Like which ones? aio is a special case async interface for O_DIRECT IO,
that's about it. It's not a generic IO interface. It's literally dio
only. And yes, then it has the option of syncing a file, and poll got
added some years ago as well. But for the longest duration of aio, it
was just dio aio. The early versions of io_uring actually added on top of
that, but I didn't feel like it belonged there.

> fragmentation is a real problem for userspace that just wants to be able
> to issue arbitrary syscalls asynchronously. IO uring could've just been
> serializing syscall numbers and arguments - that would have worked fine.

That doesn't work at all. If all syscalls had been designed with
issue + wait semantics, then yeah that would obviously be the way that
it would've been done. You just add all of them, and pass arguments,
done. But that's not reality. You can do that if you just offload to a
worker thread, but that's not how you get performance. And you could
very much STILL do just that, in fact it'd be trivial to wire up. But
nobody would use it, because something that just always punts to a
thread would be awful for performance. You may as well just do that
offload in userspace then.

Hence the majority of the work for wiring up operations that implement
existing functionality in an async way is core work. The io_uring
interface for it is the simplest part, once you have the underpinnings
doing what you want. Sometimes that's some ugly "this can't block, if it
does, return -EAGAIN", and sometimes it's refactoring things a bit so
that you can tap into the inevitable waitqueue. There's no single
recipe, it all depends on how it currently works.

> Given the history of failed AIO replacements just wanting to shove in
> something working was understandable, but I don't think those would have
> been big asks.

What are these failed AIO replacements? aio is for storage, io_uring was
never meant to be a storage only interface. The only other attempt I can
recall, back in the day, was the acall and threadlet stuff that Ingo and
zab worked on. And even that attempted to support async in a performant
way, by doing work inline whenever possible. But hard to use, as you'd
return as a different pid if the original task blocked.

>>> I'd also really like to see some more standardized mechanisms for "I'm a
>>> kernel thread doing work on behalf of some other user thread" - this
>>> comes up elsewhere, I'm talking with David Howells right now about
>>> fsconfig which is another place it is or will be coming up.
>>
>> That does exist, and it came from the io_uring side of needing exactly
>> that. This is why we have create_io_thread(). IMHO it's the only sane
>> way to do it, trying to guesstimate what happens deep down in a random
>> callstack, and setting things up appropriately, is impossible. This is
>> where most of the earlier day io_uring issues came from, and what I
>> referred to as a "poor initial design choice".
> 
> Thanks, I wasn't aware of that - that's worth highlighting. I may switch
> thread_with_file to that, and the fsconfig work David and I are talking
> about can probably use it as well.
> 
> We really should have something lighter weight that we can use for work
> items though, that's our standard mechanism for deferred work, not
> spinning up a whole kthread. We do have kthread_use_mm() - there's no
> reason we couldn't do an expanded version of that for all the other
> shared resources that need to be available.

Like io-wq does for io_uring? That's why it's there. io_uring tries not
to rely on it very much, it's considered the slow path for the above
mentioned reasons of why thread offload generally isn't a great idea.
But at least it doesn't require a full fork for each item.

> This was also another blocker in the other aborted AIO replacements, so
> reusing clone flags does seem like the most reasonable way to make
> progress here, but I would wager there's other stuff in task_struct that
> really should be shared and isn't. task_struct is up to 825 (!) lines
> now, which means good luck on even finding that stuff - maybe at some
> point we'll have to get a giant effort going to clean up and organize
> task_struct, like willy's been doing with struct page.

Well that thing is an unwieldy beast and has been for many years. So
yeah, very much agree that it needs some tender love and care, and we'd
be better off for it.

>>>> Those have since been rectified, and the code base is
>>>> stable and solid these days.
>>>
>>> good tests, code coverage analysis to verify, good syzbot coverage?
>>
>> 3x yes. Obviously I'm always going to say that tests could be better,
>> have better coverage, cover more things, because nothing is perfect (and
>> if you think it is, you're fooling yourself) and as a maintainer I want
>> perfect coverage. But we're pretty diligent these days about adding
>> tests for everything. And any regression or bug report always gets test
>> cases written.
> 
> *nod* that's encouraging. Looking forward to the umount issue being
> fixed so I can re-enable it in my tests...

I'll pick it up again soon enough, I'll let you know.

Ming Lei June 4, 2024, 11:45 p.m. UTC | #20

On Fri, May 31, 2024 at 12:21 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 5/30/24 10:02 AM, Bernd Schubert wrote:
> >
> >
> > On 5/30/24 17:36, Kent Overstreet wrote:
> >> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote:
> >>> From: Bernd Schubert <bschubert@ddn.com>
> >>>
> >>> This adds support for uring communication between kernel and
> >>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> >>> appraoch was taken from ublk.  The patches are in RFC state,
> >>> some major changes are still to be expected.
> >>>
> >>> Motivation for these patches is all to increase fuse performance.
> >>> In fuse-over-io-uring requests avoid core switching (application
> >>> on core X, processing of fuse server on random core Y) and use
> >>> shared memory between kernel and userspace to transfer data.
> >>> Similar approaches have been taken by ZUFS and FUSE2, though
> >>> not over io-uring, but through ioctl IOs
> >>
> >> What specifically is it about io-uring that's helpful here? Besides the
> >> ringbuffer?
> >>
> >> So the original mess was that because we didn't have a generic
> >> ringbuffer, we had aio, tracing, and god knows what else all
> >> implementing their own special purpose ringbuffers (all with weird
> >> quirks of debatable or no usefulness).
> >>
> >> It seems to me that what fuse (and a lot of other things want) is just a
> >> clean simple easy to use generic ringbuffer for sending what-have-you
> >> back and forth between the kernel and userspace - in this case RPCs from
> >> the kernel to userspace.
> >>
> >> But instead, the solution seems to be just toss everything into a new
> >> giant subsystem?
> >
> >
> > Hmm, initially I had thought about writing my own ring buffer, but then
> > io-uring got IORING_OP_URING_CMD, which seems to have exactly what we
> > need? From interface point of view, io-uring seems easy to use here,
> > has everything we need and kind of the same thing is used for ublk -
> > what speaks against io-uring? And what other suggestion do you have?
> >
> > I guess the same concern would also apply to ublk_drv.
> >
> > Well, decoupling from io-uring might help to get for zero-copy, as there
> > doesn't seem to be an agreement with Mings approaches (sorry I'm only
> > silently following for now).

We have concluded pipe & splice isn't good for zero copy, and io_uring
provides zc in async way, which is really nice for async application.

>
> If you have an interest in the zero copy, do chime in, it would
> certainly help get some closure on that feature. I don't think anyone
> disagrees it's a useful and needed feature, but there are different view
> points on how it's best solved.

Now generic sqe group feature is being added, and generic zero copy can be
built over it easily, can you or anyone take a look?

https://lore.kernel.org/linux-block/20240511001214.173711-1-ming.lei@redhat.com/


Thanks,
Ming

Miklos Szeredi June 11, 2024, 8:20 a.m. UTC | #21

On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:
>
> From: Bernd Schubert <bschubert@ddn.com>
>
> This adds support for uring communication between kernel and
> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
> appraoch was taken from ublk.  The patches are in RFC state,
> some major changes are still to be expected.

Thank you very much for tackling this.  I think this is an important
feature and one that could potentially have a significant effect on
fuse performance, which is something many people would love to see.

I'm thinking about the architecture and there are some questions:

Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
would just be the async part, without the mapped buffer.  I suspect
most of the performance advantage comes from this and the per-CPU
queue, not from the mapped buffer, yet most of the complexity seems to
be related to the mapped buffer.

Maybe there's an advantage in using an atomic op for WRITEV + READV,
but I'm not quite seeing it yet, since there's no syscall overhead for
separate ops.

What's the reason for separate async and sync request queues?

> Avoiding cache line bouncing / numa systems was discussed
> between Amir and Miklos before and Miklos had posted
> part of the private discussion here
> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>
> This cache line bouncing should be addressed by these patches
> as well.

Why do you think this is solved?

> I had also noticed waitq wake-up latencies in fuse before
> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>
> This spinning approach helped with performance (>40% improvement
> for file creates), but due to random server side thread/core utilization
> spinning cannot be well controlled in /dev/fuse mode.
> With fuse-over-io-uring requests are handled on the same core
> (sync requests) or on core+1 (large async requests) and performance
> improvements are achieved without spinning.

I feel this should be a scheduler decision, but the selecting the
queue needs to be based on that decision.  Maybe the scheduler people
can help out with this.

Thanks,
Miklos

Bernd Schubert June 11, 2024, 10:26 a.m. UTC | #22

On 6/11/24 10:20, Miklos Szeredi wrote:
> On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>> From: Bernd Schubert <bschubert@ddn.com>
>>
>> This adds support for uring communication between kernel and
>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic
>> appraoch was taken from ublk.  The patches are in RFC state,
>> some major changes are still to be expected.
> 
> Thank you very much for tackling this.  I think this is an important
> feature and one that could potentially have a significant effect on
> fuse performance, which is something many people would love to see.
> 
> I'm thinking about the architecture and there are some questions:
> 
> Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV?  That's
> would just be the async part, without the mapped buffer.  I suspect
> most of the performance advantage comes from this and the per-CPU
> queue, not from the mapped buffer, yet most of the complexity seems to
> be related to the mapped buffer.

I didn't try because IORING_OP_URING_CMD seems to be exactly made for
our case.

Firstly and although I didn't have time to look into it yet, but with
the current approach it should be rather simple to switch to another
ring as Kent has suggested.

Secondly, with IORING_OP_URING_CMD we already have only a single command
to submit requests and fetch the next one - half of the system calls.

Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
https://github.com/uroni/fuseuring?
I.e. it hook into the existing fuse and just changes from read()/write()
of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
control which ring/queue and which ring-entry handles the request.

Thirdly, initially I had even removed the allocation of 'struct
fuse_req' and directly allocated these on available ring entries. I.e.
the application thread was writing to the mmap ring buffer. I just
removed that code for now, as it introduced additional complexity with
an unknown performance outcome. If it should be helpful we could add
that later. I don't think we have that flexibility with
IORING_OP_READV/IORING_OP_WRITEV.

And then, personally I do not see mmap complexity - it is just very
convenient to write/read to/from the ring buffer from any kernel thread.
Currently not supported by io-uring, but I think we could even avoid any
kind of system call and let the application poll for results. Similar to
what IORING_SETUP_SQPOLL does, but without the requirement of another
kernel thread.

Most complexity and issues I got, come from the requirement of io_uring
to complete requests with io_uring_op_cmd_done. In RFCv1 you had
annotated the async shutdown method - that was indeed really painful and
resulted in a never ending number of shutdown races. Once I removed that
and also removed using a bitmap (I don't even know why I used a bitmap
in the first place in RFCv1 and not lists as in RFCv2) shutdown became
manageable.

If there would be way to tell io-uring that kernel fuse is done and to
let it complete itself whatever was not completed yet, it would would be
great help. Although from my point of view, that could be done once this
series is merged.
Or we decide to switch to Kents new ring buffer and might not have that
problem at all...

> 
> Maybe there's an advantage in using an atomic op for WRITEV + READV,
> but I'm not quite seeing it yet, since there's no syscall overhead for
> separate ops.

Here I get confused, could please explain?
Current fuse has a read + write operation - a read() system call to
process a fuse request and a write() call to submit the result and then
read() to fetch the next request.
If userspace has to send IORING_OP_READV to fetch a request and complete
with IORING_OP_IORING_OP_WRITEV it would go through existing code path
with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
but that sounds like it would need to send two ring entries? I.e. memory
and processing overhead?

And then, no way to further optimize and do fuse_req allocation on the
ring (if there are free entries). And probably also no way that we ever
let the application work in the SQPOLL way, because the application
thread does not have the right to read from the fuse-server buffer? I.e.
what I mean is that IORING_OP_URING_CMD gives a better flexibility.

Btw, another issue that is avoided with the new ring-request layout is
compatibility and alignment. The fuse header is always in a 4K section
of the request data follow then. I.e. extending request sizes does not
impose compatibility issues any more and also for passthrough and
similar - O_DIRECT can be passed through to the backend file system.
Although these issues probably need to be solved into the current fuse
protocol.

> 
> What's the reason for separate async and sync request queues?

To have credits for IO operations. For an overlay file system it might
not matter, because it might get stuck with another system call in the
underlying file system. But for something that is not system call bound
and that has control, async and sync can be handled with priority given
by userspace.

As I had written in the introduction mail, I'm currently working on
different IO sizes per ring queue - it gets even more fine grained and
with the advantage of reduced memory usage per queue when the queue has
entries for many small requests and a few large ones.

Next step would here to add credits for reads/writes (or to reserve
credits for meta operations) in the sync queue, so that meta operations
can always go through. If there should be async meta operations (through
application io-uring requests?) would need to be done for the async
queue as well.

Last but not least, with separation there is no global async queue
anymore - no global lock and cache issues.

> 
>> Avoiding cache line bouncing / numa systems was discussed
>> between Amir and Miklos before and Miklos had posted
>> part of the private discussion here
>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>
>> This cache line bouncing should be addressed by these patches
>> as well.
> 
> Why do you think this is solved?

I _guess_ that writing to the mmaped buffer and processing requests on
the same cpu core should make it possible to keep things in cpu cache. I
did not verify that with perf, though.

> 
>> I had also noticed waitq wake-up latencies in fuse before
>> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/
>>
>> This spinning approach helped with performance (>40% improvement
>> for file creates), but due to random server side thread/core utilization
>> spinning cannot be well controlled in /dev/fuse mode.
>> With fuse-over-io-uring requests are handled on the same core
>> (sync requests) or on core+1 (large async requests) and performance
>> improvements are achieved without spinning.
> 
> I feel this should be a scheduler decision, but the selecting the
> queue needs to be based on that decision.  Maybe the scheduler people
> can help out with this.

For sync requests getting the scheduler involved is what is responsible
for making really fuse slow. It schedules on random cores, that are in
sleep states and additionally frequency scaling does not go up. We
really need to stay on the same core here, as that is submitting the
result, the core is already running (i.e. not sleeping) and has data in
its cache. All benchmark results with sync requests point that out.

For async requests, the above still applies, but basically one thread is
writing/reading and the other thread handles/provides the data. Random
switching of cores is then still not good. At best and to be tested,
submitting rather large chunks to other cores.
What is indeed to be discussed (and think annotated in the corresponding
patch description), if there is a way to figure out if the other core is
already busy. But then the scheduler does not know what it is busy with
- are these existing fuse requests or something else. That part is
really hard and I don't think it makes sense to discuss this right now
before the main part is merged. IMHO, better to add a config flag for
the current cpu+1 scheduling with an annotation that this setting might
go away in the future.

Thanks,
Bernd

Miklos Szeredi June 11, 2024, 3:35 p.m. UTC | #23

On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> Secondly, with IORING_OP_URING_CMD we already have only a single command
> to submit requests and fetch the next one - half of the system calls.
>
> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
> https://github.com/uroni/fuseuring?
> I.e. it hook into the existing fuse and just changes from read()/write()
> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
> control which ring/queue and which ring-entry handles the request.

Unlike system calls, io_uring ops should have very little overhead.
That's one of the main selling points of io_uring (as described in the
io_uring(7) man page).

So I don't think it matters to performance whether there's a combined
WRITEV + READV (or COMMIT + FETCH) op or separate ops.

The advantage of separate ops is more flexibility and less complexity
(do only one thing in an op).

> Thirdly, initially I had even removed the allocation of 'struct
> fuse_req' and directly allocated these on available ring entries. I.e.
> the application thread was writing to the mmap ring buffer. I just
> removed that code for now, as it introduced additional complexity with
> an unknown performance outcome. If it should be helpful we could add
> that later. I don't think we have that flexibility with
> IORING_OP_READV/IORING_OP_WRITEV.

I think I understand what you'd like to see in the end: basically a
reverse io_uring, where requests are placed on a "submission queue" by
the kernel and completed requests are placed on a completion queue by
the userspace.  Exactly the opposite of current io_uring.

The major difference between your idea of a fuse_uring and the
io_uring seems to be that you place not only the request on the shared
buffer, but the data as well.   I don't think this is a good idea,
since it will often incur one more memory copy.  Otherwise the idea
itself seems sound.

The implementation twisted due to having to integrate it with
io_uring.  Unfortunately placing fuse requests directly into the
io_uring queue doesn't work, due to the reversal of roles and the size
difference between sqe and cqe entries.  Also the shared buffer seems
to lose its ring aspect due to the fact that fuse doesn't get notified
when a request is taken off the queue (io_uring_cqe_seen(3)).

So I think either better integration with io_uring is needed with
support for "reverse submission" or a new interface.

> >
> > Maybe there's an advantage in using an atomic op for WRITEV + READV,
> > but I'm not quite seeing it yet, since there's no syscall overhead for
> > separate ops.
>
> Here I get confused, could please explain?
> Current fuse has a read + write operation - a read() system call to
> process a fuse request and a write() call to submit the result and then
> read() to fetch the next request.
> If userspace has to send IORING_OP_READV to fetch a request and complete
> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
> but that sounds like it would need to send two ring entries? I.e. memory
> and processing overhead?

Overhead should be minimal.

> And then, no way to further optimize and do fuse_req allocation on the
> ring (if there are free entries). And probably also no way that we ever
> let the application work in the SQPOLL way, because the application
> thread does not have the right to read from the fuse-server buffer? I.e.
> what I mean is that IORING_OP_URING_CMD gives a better flexibility.

There should be no difference between IORING_OP_URING_CMD and
IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
see why polling would work differently: the writev should complete
immediately and then the readv is queued.  Same as what effectively
happens with IORING_OP_URING_CMD, no?

> Btw, another issue that is avoided with the new ring-request layout is
> compatibility and alignment. The fuse header is always in a 4K section
> of the request data follow then. I.e. extending request sizes does not
> impose compatibility issues any more and also for passthrough and
> similar - O_DIRECT can be passed through to the backend file system.
> Although these issues probably need to be solved into the current fuse
> protocol.

Yes.

> Last but not least, with separation there is no global async queue
> anymore - no global lock and cache issues.

The global async code should be moved into the /dev/fuse specific
"legacy" queuing so it doesn't affect either uring or virtiofs
queuing.

> >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
> >>
> >> This cache line bouncing should be addressed by these patches
> >> as well.
> >
> > Why do you think this is solved?
>
>
> I _guess_ that writing to the mmaped buffer and processing requests on
> the same cpu core should make it possible to keep things in cpu cache. I
> did not verify that with perf, though.

Well, the issue is with any context switch that happens in the
multithreaded fuse server process.  Shared address spaces will have a
common "which CPU this is currently running on" bitmap
(mm->cpu_bitmap), which is updated whenever one of the threads using
this address space gets scheduled or descheduled.

Now imagine a fuse server running on a big numa system, which has
threads bound to each CPU.  The server tries to avoid using sharing
data structures between threads, so that cache remains local.  But it
can't avoid updating this bitmap on schedule.  The bitmap can pack 512
CPUs into a single cacheline, which means that thread locality is
compromised.

I'm somewhat surprised that this doesn't turn up in profiles in real
life, but I guess it's not a big deal in most cases.  I only observed
it with a special "no-op" fuse server running on big numa and with
per-thread queuing, etc. enabled (fuse2).

> For sync requests getting the scheduler involved is what is responsible
> for making really fuse slow. It schedules on random cores, that are in
> sleep states and additionally frequency scaling does not go up. We
> really need to stay on the same core here, as that is submitting the
> result, the core is already running (i.e. not sleeping) and has data in
> its cache. All benchmark results with sync requests point that out.

No arguments about that.

> For async requests, the above still applies, but basically one thread is
> writing/reading and the other thread handles/provides the data. Random
> switching of cores is then still not good. At best and to be tested,
> submitting rather large chunks to other cores.
> What is indeed to be discussed (and think annotated in the corresponding
> patch description), if there is a way to figure out if the other core is
> already busy. But then the scheduler does not know what it is busy with
> - are these existing fuse requests or something else. That part is
> really hard and I don't think it makes sense to discuss this right now
> before the main part is merged. IMHO, better to add a config flag for
> the current cpu+1 scheduling with an annotation that this setting might
> go away in the future.

The cpu + 1 seems pretty arbitrary, and could be a very bad decision
if there are independent tasks bound to certain CPUs or if the target
turns out to be on a very distant CPU.

I'm not sure what the right answer is.   It's probably something like:
try to schedule this on a CPU which is not busy but is not very
distant from this one.  The scheduler can probably answer this
question, but there's no current API for this.

Thanks,
Miklos

Bernd Schubert June 11, 2024, 5:37 p.m. UTC | #24

On 6/11/24 17:35, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>> Secondly, with IORING_OP_URING_CMD we already have only a single command
>> to submit requests and fetch the next one - half of the system calls.
>>
>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
>> https://github.com/uroni/fuseuring?
>> I.e. it hook into the existing fuse and just changes from read()/write()
>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
>> control which ring/queue and which ring-entry handles the request.
> 
> Unlike system calls, io_uring ops should have very little overhead.
> That's one of the main selling points of io_uring (as described in the
> io_uring(7) man page).
> 
> So I don't think it matters to performance whether there's a combined
> WRITEV + READV (or COMMIT + FETCH) op or separate ops.

This has to be performance proven and is no means what I'm seeing. How
should io-uring improve performance if you have the same number of
system calls?

As I see it (@Jens or @Pavel or anyone else please correct me if I'm
wrong), advantage of io-uring comes when there is no syscall overhead at
all - either you have a ring with multiple entries and then one side
operates on multiple entries or you have polling and no syscall overhead
either. We cannot afford cpu intensive polling - out of question,
besides that I had even tried SQPOLL and it made things worse (that is
actually where my idea about application polling comes from).
As I see it, for sync blocking calls (like meta operations) with one
entry in the queue, you would get no advantage with
IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
one to submit from kernel to userspace and another from userspace to
kernel. Why should io-uring be faster there?

And from my testing this is exactly what I had seen - io-uring for meta
requests (i.e. without a large request queue and *without* core
affinity) makes meta operations even slower that /dev/fuse.

For anything that imposes a large ring queue and where either side
(kernel or userspace) needs to process multiple ring entries - system
call overhead gets reduced by the queue size. Just for DIO or meta
operations that is hard to reach.

Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
I.e. we would not have encoded in the request which queue it belongs to?

> 
> The advantage of separate ops is more flexibility and less complexity
> (do only one thing in an op)

Did you look at patch 12/19? It just does
fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

> 
>> Thirdly, initially I had even removed the allocation of 'struct
>> fuse_req' and directly allocated these on available ring entries. I.e.
>> the application thread was writing to the mmap ring buffer. I just
>> removed that code for now, as it introduced additional complexity with
>> an unknown performance outcome. If it should be helpful we could add
>> that later. I don't think we have that flexibility with
>> IORING_OP_READV/IORING_OP_WRITEV.
> 
> I think I understand what you'd like to see in the end: basically a
> reverse io_uring, where requests are placed on a "submission queue" by
> the kernel and completed requests are placed on a completion queue by
> the userspace.  Exactly the opposite of current io_uring.
> 
> The major difference between your idea of a fuse_uring and the
> io_uring seems to be that you place not only the request on the shared
> buffer, but the data as well.   I don't think this is a good idea,
> since it will often incur one more memory copy.  Otherwise the idea
> itself seems sound.

Coud you explain what you mean with "one more memory copy"? As it is
right now, 'struct fuse_req' is always allocated as it was before and
then a copy is done to the ring entry. No difference to legacy /dev/fuse
IO, which also copies to the read buffer.

If we would avoid allocating 'struct fuse_req' when there are free ring
entry requests we would reduce copies, but never increase?

Btw, advantage for the ring is on the libfuse side, where the
fuse-request buffer is assigned to the CQE and as long as the request is
not completed, the buffer is valid. (For /dev/fuse IO that could be
solved in libfuse by decoupling request memory from the thread, but with
the current ring buffer design that just happens more naturally and
memory is limited by the queue size.)

> 
> The implementation twisted due to having to integrate it with
> io_uring.  Unfortunately placing fuse requests directly into the
> io_uring queue doesn't work, due to the reversal of roles and the size
> difference between sqe and cqe entries.  Also the shared buffer seems
> to lose its ring aspect due to the fact that fuse doesn't get notified
> when a request is taken off the queue (io_uring_cqe_seen(3)).
> 
> So I think either better integration with io_uring is needed with
> support for "reverse submission" or a new interface.

Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
ublk_drv  also works exactly that way. I had pointed it out before,
initially I had considered to write a reverse io-uring myself and then
exactly at that time ublk came up.

The interface of that 'reverse io' to io-uring is really simple.

1) Userspace sends a IORING_OP_URING_CMD SQE
2) That CMD gets handled/queued by struct file_operations::uring_cmd /
fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
request
3) When fuse client has data to complete the request, it calls
io_uring_cmd_done() and fuse server receives a CQE with the fuse request.

Personally I don't see anything twisted here, one just needs to
understand that IORING_OP_URING_CMD was written for that reverse order.

(There came up a light twisting when io-uring introduced issue_flags -
that is part of discussion of patch 19/19 with Jens in the series. Jens
suggested to work on io-uring improvements once the main series is
merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
Jens for help once the other parts are merged. Right now that easy to
work around by always submitting with an io-uring task).

Also, that simplicity is the reason why I'm hesitating a bit to work on
Kents new ring, as io-uring already has all what we need and with a
rather simple interface.

Well, maybe you mean patch 09/19 "Add a dev_release exception for
fuse-over-io-uring". Yep, that is the shutdown part I'm not too happy
about and which initially lead to the async release thread in RFCv1.

> 
>>>
>>> Maybe there's an advantage in using an atomic op for WRITEV + READV,
>>> but I'm not quite seeing it yet, since there's no syscall overhead for
>>> separate ops.
>>
>> Here I get confused, could please explain?
>> Current fuse has a read + write operation - a read() system call to
>> process a fuse request and a write() call to submit the result and then
>> read() to fetch the next request.
>> If userspace has to send IORING_OP_READV to fetch a request and complete
>> with IORING_OP_IORING_OP_WRITEV it would go through existing code path
>> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK,
>> but that sounds like it would need to send two ring entries? I.e. memory
>> and processing overhead?
> 
> Overhead should be minimal.

See above, for single entry blocking requests you get two system calls +
io-uring overhead.

> 
>> And then, no way to further optimize and do fuse_req allocation on the
>> ring (if there are free entries). And probably also no way that we ever
>> let the application work in the SQPOLL way, because the application
>> thread does not have the right to read from the fuse-server buffer? I.e.
>> what I mean is that IORING_OP_URING_CMD gives a better flexibility.
> 
> There should be no difference between IORING_OP_URING_CMD and
> IORING_OP_WRITEV +  IORING_OP_READV in this respect.  At least I don't
> see why polling would work differently: the writev should complete
> immediately and then the readv is queued.  Same as what effectively
> happens with IORING_OP_URING_CMD, no?

Polling yes, but without shared memory the application thread does not
have the right to read from fuse userspace server request?

> 
>> Btw, another issue that is avoided with the new ring-request layout is
>> compatibility and alignment. The fuse header is always in a 4K section
>> of the request data follow then. I.e. extending request sizes does not
>> impose compatibility issues any more and also for passthrough and
>> similar - O_DIRECT can be passed through to the backend file system.
>> Although these issues probably need to be solved into the current fuse
>> protocol.
> 
> Yes.
> 
>> Last but not least, with separation there is no global async queue
>> anymore - no global lock and cache issues.
> 
> The global async code should be moved into the /dev/fuse specific
> "legacy" queuing so it doesn't affect either uring or virtiofs
> queuing.

Yep, wait a few days I have seen your recent patch and I'm may to add
that to my series. I actually considered to point of that the bg queue
could be handled by that series as well, but then preferred to just add
patch for that in my series, which will make use of it for the ring queue.

> 
>>>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/
>>>>
>>>> This cache line bouncing should be addressed by these patches
>>>> as well.
>>>
>>> Why do you think this is solved?
>>
>>
>> I _guess_ that writing to the mmaped buffer and processing requests on
>> the same cpu core should make it possible to keep things in cpu cache. I
>> did not verify that with perf, though.
> 
> Well, the issue is with any context switch that happens in the
> multithreaded fuse server process.  Shared address spaces will have a
> common "which CPU this is currently running on" bitmap
> (mm->cpu_bitmap), which is updated whenever one of the threads using
> this address space gets scheduled or descheduled.
> 
> Now imagine a fuse server running on a big numa system, which has
> threads bound to each CPU.  The server tries to avoid using sharing
> data structures between threads, so that cache remains local.  But it
> can't avoid updating this bitmap on schedule.  The bitmap can pack 512
> CPUs into a single cacheline, which means that thread locality is
> compromised.

To be honest, I wonder how you worked around scheduler issues on waking
up the application thread. Did you core bind application threads as well
(I mean besides fuse server threads)? We now have this (unexported)
wake_on_current_cpu. Last year that still wasn't working perfectly well
and  Hillf Danton has suggested the 'seesaw' approach. And with that the
scheduler was working very well. You could get the same with application
core binding, but with 512 CPUs that is certainly not done manually
anymore. Did you use a script to bind application threads or did you
core bind from within the application?

> 
> I'm somewhat surprised that this doesn't turn up in profiles in real
> life, but I guess it's not a big deal in most cases.  I only observed
> it with a special "no-op" fuse server running on big numa and with
> per-thread queuing, etc. enabled (fuse2).

Ok, I'm testing only with 32 cores and two numa nodes. For final
benchmarking I could try to get a more recent AMD based system with 96
cores. I don't think we have anything near 512 CPUs in the lab. I'm not
aware of such customer systems either.

> 
>> For sync requests getting the scheduler involved is what is responsible
>> for making really fuse slow. It schedules on random cores, that are in
>> sleep states and additionally frequency scaling does not go up. We
>> really need to stay on the same core here, as that is submitting the
>> result, the core is already running (i.e. not sleeping) and has data in
>> its cache. All benchmark results with sync requests point that out.
> 
> No arguments about that.
> 
>> For async requests, the above still applies, but basically one thread is
>> writing/reading and the other thread handles/provides the data. Random
>> switching of cores is then still not good. At best and to be tested,
>> submitting rather large chunks to other cores.
>> What is indeed to be discussed (and think annotated in the corresponding
>> patch description), if there is a way to figure out if the other core is
>> already busy. But then the scheduler does not know what it is busy with
>> - are these existing fuse requests or something else. That part is
>> really hard and I don't think it makes sense to discuss this right now
>> before the main part is merged. IMHO, better to add a config flag for
>> the current cpu+1 scheduling with an annotation that this setting might
>> go away in the future.
> 
> The cpu + 1 seems pretty arbitrary, and could be a very bad decision
> if there are independent tasks bound to certain CPUs or if the target
> turns out to be on a very distant CPU.
> 
> I'm not sure what the right answer is.   It's probably something like:
> try to schedule this on a CPU which is not busy but is not very
> distant from this one.  The scheduler can probably answer this
> question, but there's no current API for this.

This is just another optimization. What you write is true and I was
aware of that. It was just a rather simple optimization that improved
results - enough to demo it. We can work with scheduler people in the
future on that or we add a bit of our own logic and create mapping of
cpu -> next-cpu-on-same-numa-node. Certainly an API and help from the
scheduler would be preferred.

Thanks,
Bernd

Kent Overstreet June 11, 2024, 11:35 p.m. UTC | #25

On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote:
> 
> 
> On 6/11/24 17:35, Miklos Szeredi wrote:
> > On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> > 
> >> Secondly, with IORING_OP_URING_CMD we already have only a single command
> >> to submit requests and fetch the next one - half of the system calls.
> >>
> >> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
> >> https://github.com/uroni/fuseuring?
> >> I.e. it hook into the existing fuse and just changes from read()/write()
> >> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
> >> control which ring/queue and which ring-entry handles the request.
> > 
> > Unlike system calls, io_uring ops should have very little overhead.
> > That's one of the main selling points of io_uring (as described in the
> > io_uring(7) man page).
> > 
> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
> 
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?
> 
> As I see it (@Jens or @Pavel or anyone else please correct me if I'm
> wrong), advantage of io-uring comes when there is no syscall overhead at
> all - either you have a ring with multiple entries and then one side
> operates on multiple entries or you have polling and no syscall overhead
> either. We cannot afford cpu intensive polling - out of question,
> besides that I had even tried SQPOLL and it made things worse (that is
> actually where my idea about application polling comes from).
> As I see it, for sync blocking calls (like meta operations) with one
> entry in the queue, you would get no advantage with
> IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
> one to submit from kernel to userspace and another from userspace to
> kernel. Why should io-uring be faster there?
> 
> And from my testing this is exactly what I had seen - io-uring for meta
> requests (i.e. without a large request queue and *without* core
> affinity) makes meta operations even slower that /dev/fuse.
> 
> For anything that imposes a large ring queue and where either side
> (kernel or userspace) needs to process multiple ring entries - system
> call overhead gets reduced by the queue size. Just for DIO or meta
> operations that is hard to reach.
> 
> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

Want to try out my new ringbuffer syscall?

I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only
skimmed. But using it to replace the read/write syscall overhead should
be straightforward; you'll want to spin up a kthread for responding to
requests.

The next thing I was going to look at is how you guys are using splice,
we want to get away from that too.

Brian was also saying the fuse virtio_fs code may be worth
investigating, maybe that could be adapted?

Miklos Szeredi June 12, 2024, 7:39 a.m. UTC | #26

On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:

> > So I don't think it matters to performance whether there's a combined
> > WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>
> This has to be performance proven and is no means what I'm seeing. How
> should io-uring improve performance if you have the same number of
> system calls?

The ops can be queued together and submitted together.  Two separate
(but possibly linked) ops should result in exactly the same number of
syscalls as a single combined op.

> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
> I.e. we would not have encoded in the request which queue it belongs to?

The original idea was to use the cloned /dev/fuse fd to sort requests
into separate queues.  That was only half finished: the input queue is
currently shared by all clones, but once a request is read by the
server from a particular clone it is put into a separate processing
queue.   Adding separate input queues to each clone should also be
possible.

I'm not saying this is definitely the direction we should be taking,
but it's something to consider.

> > The advantage of separate ops is more flexibility and less complexity
> > (do only one thing in an op)
>
> Did you look at patch 12/19? It just does
> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.

That function name indicates that this is too complex: it is doing two
independent things (ending one request and fetching the next).

Fine if it's a valid optimization, but I'm saying that it likely isn't.

> > The major difference between your idea of a fuse_uring and the
> > io_uring seems to be that you place not only the request on the shared
> > buffer, but the data as well.   I don't think this is a good idea,
> > since it will often incur one more memory copy.  Otherwise the idea
> > itself seems sound.
>
> Coud you explain what you mean with "one more memory copy"?

If the filesystem is providing the result of a READ request as a
pointer to a buffer (which can be the case with fuse_reply_data()),
then that buffer will need to be copied to the shared buffer, and from
the shared buffer to the read destination.

That first copy is unnecessary if the kernel receives the pointer to
the userspace buffer and copies the data directly to the destination.

> > So I think either better integration with io_uring is needed with
> > support for "reverse submission" or a new interface.
>
> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
> ublk_drv  also works exactly that way. I had pointed it out before,
> initially I had considered to write a reverse io-uring myself and then
> exactly at that time ublk came up.

I'm just looking for answers why this architecture is the best.  Maybe
it is, but I find it too complex and can't explain why it's going to
perform better than a much simpler single ring model.

> The interface of that 'reverse io' to io-uring is really simple.
>
> 1) Userspace sends a IORING_OP_URING_CMD SQE
> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
> request
> 3) When fuse client has data to complete the request, it calls
> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>
> Personally I don't see anything twisted here, one just needs to
> understand that IORING_OP_URING_CMD was written for that reverse order.

That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
And that's just the queuing.

> (There came up a light twisting when io-uring introduced issue_flags -
> that is part of discussion of patch 19/19 with Jens in the series. Jens
> suggested to work on io-uring improvements once the main series is
> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
> Jens for help once the other parts are merged. Right now that easy to
> work around by always submitting with an io-uring task).
>
> Also, that simplicity is the reason why I'm hesitating a bit to work on
> Kents new ring, as io-uring already has all what we need and with a
> rather simple interface.

I'm in favor of using io_uring, if possible.

I'm also in favor of a single shared buffer (ring) if possible.  Using
cloned fd + plain READV / WRITEV ops is one possibility.

But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
could be inlined in the majority of cases into that 80 byte free space
in the sqe.  Also might consider an extended cqe mode, where short
fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
byte payload).

> To be honest, I wonder how you worked around scheduler issues on waking
> up the application thread. Did you core bind application threads as well
> (I mean besides fuse server threads)? We now have this (unexported)
> wake_on_current_cpu. Last year that still wasn't working perfectly well
> and  Hillf Danton has suggested the 'seesaw' approach. And with that the
> scheduler was working very well. You could get the same with application
> core binding, but with 512 CPUs that is certainly not done manually
> anymore. Did you use a script to bind application threads or did you
> core bind from within the application?

Probably, I don't remember anymore.

Thanks,
Miklos

Bernd Schubert June 12, 2024, 1:32 p.m. UTC | #27

On 6/12/24 09:39, Miklos Szeredi wrote:
> On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
> 
>>> So I don't think it matters to performance whether there's a combined
>>> WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>>
>> This has to be performance proven and is no means what I'm seeing. How
>> should io-uring improve performance if you have the same number of
>> system calls?
> 
> The ops can be queued together and submitted together.  Two separate
> (but possibly linked) ops should result in exactly the same number of
> syscalls as a single combined op.

As I wrote before, that requires the double amount of queue buffer memory.
Goes totally into the opposite direction of what I'm currently working
on - to use less memory, as not all requests need 1MB buffers and as we want
to use as little as possible memory.

> 
>> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
>> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
>> I.e. we would not have encoded in the request which queue it belongs to?
> 
> The original idea was to use the cloned /dev/fuse fd to sort requests
> into separate queues.  That was only half finished: the input queue is
> currently shared by all clones, but once a request is read by the
> server from a particular clone it is put into a separate processing
> queue.   Adding separate input queues to each clone should also be
> possible.
> 
> I'm not saying this is definitely the direction we should be taking,
> but it's something to consider.

I was considering to use that for the mmap method, but then found it easier
to just add per numa to an rb-tree. Well, maybe I should reconsider, as
the current patch series clones the device anyway.

> 
>>> The advantage of separate ops is more flexibility and less complexity
>>> (do only one thing in an op)
>>
>> Did you look at patch 12/19? It just does
>> fuse_uring_req_end_and_get_next(). That part isn't complex, imho.
> 
> That function name indicates that this is too complex: it is doing two
> independent things (ending one request and fetching the next).
> 
> Fine if it's a valid optimization, but I'm saying that it likely isn't.

Would it help, if it would be in two lines? Like
fuse_uring_req_end();
fuse_uring_req_next();

It has to check if there are requests queued, as it goes on hold if not
and then won't process the queue.

> 
>>> The major difference between your idea of a fuse_uring and the
>>> io_uring seems to be that you place not only the request on the shared
>>> buffer, but the data as well.   I don't think this is a good idea,
>>> since it will often incur one more memory copy.  Otherwise the idea
>>> itself seems sound.
>>
>> Coud you explain what you mean with "one more memory copy"?
> 
> If the filesystem is providing the result of a READ request as a
> pointer to a buffer (which can be the case with fuse_reply_data()),
> then that buffer will need to be copied to the shared buffer, and from
> the shared buffer to the read destination.
> 
> That first copy is unnecessary if the kernel receives the pointer to
> the userspace buffer and copies the data directly to the destination.

I didn't do that yet, as we are going to use the ring buffer for requests,
i.e. the ring buffer immediately gets all the data from network, there is
no copy. Even if the ring buffer would get data from local disk - there
is no need to use a separate application buffer anymore. And with that
there is just no extra copy

Keep in mind that the ring buffers are coupled with the request and not
the processing thread as in current libfuse - the buffer is valid as
long as the request is not completed. That part is probably harder with
IORING_OP_READV/IORING_OP_WRITEV, especially if you need two of them.

Your idea sounds useful if userspace would have its own cache outside
of ring buffers and that could be added in as another optimization as
something like:

- fuse-ring-req gets a user pointer
- flag if user pointer it set
- kernel acts on the flag
- completion gets send to userspace by:
    - if next request is already in the queue : piggy-back completion
      into the next request
    - if no next request: send a separate completion message

If you want, I could add that as another optimization patch to the next RFC.

Maybe that is also useful for existing libfuse applications that are used
to the fact that the request buffer is associated with the thread and not
the request itself.

If we want real zero copy, we can add in Mings work on that.
I'm going to look into that today or tomorrow.

> 
>>> So I think either better integration with io_uring is needed with
>>> support for "reverse submission" or a new interface.
>>
>> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And
>> ublk_drv  also works exactly that way. I had pointed it out before,
>> initially I had considered to write a reverse io-uring myself and then
>> exactly at that time ublk came up.
> 
> I'm just looking for answers why this architecture is the best.  Maybe
> it is, but I find it too complex and can't explain why it's going to
> perform better than a much simpler single ring model.
> 
>> The interface of that 'reverse io' to io-uring is really simple.
>>
>> 1) Userspace sends a IORING_OP_URING_CMD SQE
>> 2) That CMD gets handled/queued by struct file_operations::uring_cmd /
>> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the
>> request
>> 3) When fuse client has data to complete the request, it calls
>> io_uring_cmd_done() and fuse server receives a CQE with the fuse request.
>>
>> Personally I don't see anything twisted here, one just needs to
>> understand that IORING_OP_URING_CMD was written for that reverse order.
> 
> That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
> And that's just the queuing.

Dunno, from my point of view the main logic is so much simpler than what
fuse_dev_do_read() has - that function checks multiple queues, takes multiple
locks, has to add the fuse-req to a (now hashed) list and has restart loop.
If possible, I really wouldn't like to make that even more complex.

But we want to add it:

- Selecting different ring entries based on their io-size (what I'm currently
adding in, to reduce memory per queue). I don't think that
with the current wake logic that would be even possible

- Add in credits for for different IO types to avoid that an IO type
can fill the entire queue. Right now the ring has only have separation of
sync/async, but I don't think that is enough.

To compare, could you please check the code flow of
FUSE_URING_REQ_COMMIT_AND_FETCH? I have no issue to split
fuse_uring_req_end_and_get_next() into two functions.

What I mean is that the code flow is hopefully not be hard to follow,
it ends the request and then puts the entry on the avail lets
Then it checks if there is pending fuse request and handles that,
from my point of view that is much easier to follow and with less conditions
than what fuse_dev_do_read() has. And that although there is already
a separation of sync and async queues.

> 
>> (There came up a light twisting when io-uring introduced issue_flags -
>> that is part of discussion of patch 19/19 with Jens in the series. Jens
>> suggested to work on io-uring improvements once the main series is
>> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask
>> Jens for help once the other parts are merged. Right now that easy to
>> work around by always submitting with an io-uring task).
>>
>> Also, that simplicity is the reason why I'm hesitating a bit to work on
>> Kents new ring, as io-uring already has all what we need and with a
>> rather simple interface.
> 
> I'm in favor of using io_uring, if possible.

I have no objections, but I would also like to see an RHEL version with it...

> 
> I'm also in favor of a single shared buffer (ring) if possible.  Using
> cloned fd + plain READV / WRITEV ops is one possibility.

Cons:
- READV / WRITEV would need to be coupled in order to avoid two io-uring
cmd-enter system calls - double amount of memory
- Not impossible, but harder to achieve - request buffer belongs
to the request itself.
- request data section and fuse header are not clearly separated,
data alignment compat issues.
- different IO sizes hard to impossible - with large queue sizes high memory
usage, even if the majority would need small requests only
- Request type credits much harder to achieve
- The vfs application cannot directly write into the ring buffer, reduces
future optimizations
- new zero copy approach Ming Lei is working on cannot be used
- Not very flexible for future additions, IMHO

Pros:
- probably less code additions
- No shutdown issues
- existing splice works

> 
> But I'm not opposed to IORING_OP_URING_CMD either.   Btw, fuse reply
> could be inlined in the majority of cases into that 80 byte free space
> in the sqe.  Also might consider an extended cqe mode, where short
> fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112
> byte payload).

That conflicts with that we want to have the fuse header in a separate section
to avoid alignment and compat issues. In fact, we might even want to make that
header section depending on the system page size.
And then what would be the advantage, we have the buffer anyway?

Thanks,
Bernd

PS: What I definitely realize that I should have talked at LSFMM2023 why
I had taken that approach and should have reduced slides about the
architecture and performance.

Bernd Schubert June 12, 2024, 1:46 p.m. UTC | #28

On 6/12/24 15:32, Bernd Schubert wrote:
> On 6/12/24 09:39, Miklos Szeredi wrote:
>>> Personally I don't see anything twisted here, one just needs to
>>> understand that IORING_OP_URING_CMD was written for that reverse order.
>>
>> That's just my gut feeling.   fuse/dev_uring.c is 1233 in this RFC.
>> And that's just the queuing.
> 

Btw, counting lines, majority of that is not queuing and handling requests,
but setting up things, shutdown (start/stop is already almost half of the file)
and then and doing sanity checks, as in fuse_uring_get_verify_queue().

Bernd Schubert June 12, 2024, 1:53 p.m. UTC | #29

On 6/12/24 01:35, Kent Overstreet wrote:
> On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote:
>>
>>
>> On 6/11/24 17:35, Miklos Szeredi wrote:
>>> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote:
>>>
>>>> Secondly, with IORING_OP_URING_CMD we already have only a single command
>>>> to submit requests and fetch the next one - half of the system calls.
>>>>
>>>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach?
>>>> https://github.com/uroni/fuseuring?
>>>> I.e. it hook into the existing fuse and just changes from read()/write()
>>>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero
>>>> control which ring/queue and which ring-entry handles the request.
>>>
>>> Unlike system calls, io_uring ops should have very little overhead.
>>> That's one of the main selling points of io_uring (as described in the
>>> io_uring(7) man page).
>>>
>>> So I don't think it matters to performance whether there's a combined
>>> WRITEV + READV (or COMMIT + FETCH) op or separate ops.
>>
>> This has to be performance proven and is no means what I'm seeing. How
>> should io-uring improve performance if you have the same number of
>> system calls?
>>
>> As I see it (@Jens or @Pavel or anyone else please correct me if I'm
>> wrong), advantage of io-uring comes when there is no syscall overhead at
>> all - either you have a ring with multiple entries and then one side
>> operates on multiple entries or you have polling and no syscall overhead
>> either. We cannot afford cpu intensive polling - out of question,
>> besides that I had even tried SQPOLL and it made things worse (that is
>> actually where my idea about application polling comes from).
>> As I see it, for sync blocking calls (like meta operations) with one
>> entry in the queue, you would get no advantage with
>> IORING_OP_READV/IORING_OP_WRITEV -  io-uring has  do two system calls -
>> one to submit from kernel to userspace and another from userspace to
>> kernel. Why should io-uring be faster there?
>>
>> And from my testing this is exactly what I had seen - io-uring for meta
>> requests (i.e. without a large request queue and *without* core
>> affinity) makes meta operations even slower that /dev/fuse.
>>
>> For anything that imposes a large ring queue and where either side
>> (kernel or userspace) needs to process multiple ring entries - system
>> call overhead gets reduced by the queue size. Just for DIO or meta
>> operations that is hard to reach.
>>
>> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would
>> change in fuse kernel? I.e. IOs would go via fuse_dev_read()?
>> I.e. we would not have encoded in the request which queue it belongs to?
> 
> Want to try out my new ringbuffer syscall?
> 
> I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only
> skimmed. But using it to replace the read/write syscall overhead should
> be straightforward; you'll want to spin up a kthread for responding to
> requests.

I will definitely look at it this week. Although I don't like the idea
to have a new kthread. We already have an application thread and have
the fuse server thread, why do we need another one?

> 
> The next thing I was going to look at is how you guys are using splice,
> we want to get away from that too.

Well, Ming Lei is working on that for ublk_drv and I guess that new approach
could be adapted as well onto the current way of io-uring.
It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.

https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/

> 
> Brian was also saying the fuse virtio_fs code may be worth
> investigating, maybe that could be adapted?

I need to check, but really, the majority of the new additions
is just to set up things, shutdown and to have sanity checks.
Request sending/completing to/from the ring is not that much new lines.


Thanks,
Bernd

Miklos Szeredi June 12, 2024, 2:07 p.m. UTC | #30

On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:

> I didn't do that yet, as we are going to use the ring buffer for requests,
> i.e. the ring buffer immediately gets all the data from network, there is
> no copy. Even if the ring buffer would get data from local disk - there
> is no need to use a separate application buffer anymore. And with that
> there is just no extra copy

Let's just tackle this shared request buffer, as it seems to be a
central part of your design.

You say the shared buffer is used to immediately get the data from the
network (or various other sources), which is completely viable.

And then the kernel will do the copy from the shared buffer.  Single copy, fine.

But if the buffer wasn't shared?  What would be the difference?
Single copy also.

Why is the shared buffer better?  I mean it may even be worse due to
cache aliasing issues on certain architectures.  copy_to_user() /
copy_from_user() are pretty darn efficient.

Why is it better to have that buffer managed by kernel?  Being locked
in memory (being unswappable) is probably a disadvantage as well.  And
if locking is required, it can be done on the user buffer.

And there are all the setup and teardown complexities...

Note: the ring buffer used by io_uring is different.  It literally
allows communication without invoking any system calls in certain
cases.  That shared buffer doesn't add anything like that.  At least I
don't see what it actually adds.

Hmm?

Thanks,
Miklos

Kent Overstreet June 12, 2024, 2:19 p.m. UTC | #31

On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> I will definitely look at it this week. Although I don't like the idea
> to have a new kthread. We already have an application thread and have
> the fuse server thread, why do we need another one?

Ok, I hadn't found the fuse server thread - that should be fine.

> > 
> > The next thing I was going to look at is how you guys are using splice,
> > we want to get away from that too.
> 
> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> could be adapted as well onto the current way of io-uring.
> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> 
> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> 
> > 
> > Brian was also saying the fuse virtio_fs code may be worth
> > investigating, maybe that could be adapted?
> 
> I need to check, but really, the majority of the new additions
> is just to set up things, shutdown and to have sanity checks.
> Request sending/completing to/from the ring is not that much new lines.

What I'm wondering is how read/write requests are handled. Are the data
payloads going in the same ringbuffer as the commands? That could work,
if the ringbuffer is appropriately sized, but alignment is a an issue.

We just looked up the device DMA requirements and with modern NVME only
4 byte alignment is required, but the block layer likely isn't set up to
handle that.

So - prearranged buffer? Or are you using splice to get pages that
userspace has read into into the kernel pagecache?

Bernd Schubert June 12, 2024, 2:56 p.m. UTC | #32

On 6/12/24 16:07, Miklos Szeredi wrote:
> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
> 
>> I didn't do that yet, as we are going to use the ring buffer for requests,
>> i.e. the ring buffer immediately gets all the data from network, there is
>> no copy. Even if the ring buffer would get data from local disk - there
>> is no need to use a separate application buffer anymore. And with that
>> there is just no extra copy
> 
> Let's just tackle this shared request buffer, as it seems to be a
> central part of your design.
> 
> You say the shared buffer is used to immediately get the data from the
> network (or various other sources), which is completely viable.
> 
> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
> 
> But if the buffer wasn't shared?  What would be the difference?
> Single copy also.
> 
> Why is the shared buffer better?  I mean it may even be worse due to
> cache aliasing issues on certain architectures.  copy_to_user() /
> copy_from_user() are pretty darn efficient.

Right now we have:

- Application thread writes into the buffer, then calls io_uring_cmd_done

I can try to do without mmap and set a pointer to the user buffer in the 
80B section of the SQE. I'm not sure if the application is allowed to 
write into that buffer, possibly/probably we will be forced to use 
io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
anyway). My greatest fear here is that the extra task has performance 
implications for sync requests.

> 
> Why is it better to have that buffer managed by kernel?  Being locked
> in memory (being unswappable) is probably a disadvantage as well.  And
> if locking is required, it can be done on the user buffer.

Well, let me try to give the buffer in the 80B section.

> 
> And there are all the setup and teardown complexities...

If the buffer in the 80B section works setup becomes easier, mmap and 
ioctls go away. Teardown, well, we still need the workaround as we need 
to handle io_uring_cmd_done, but if you could live with that for the 
instance, I would ask Jens or Pavel or Ming for help if we could solve 
that in io-uring itself.
Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
you have any another idea about it?

> 
> Note: the ring buffer used by io_uring is different.  It literally
> allows communication without invoking any system calls in certain
> cases.  That shared buffer doesn't add anything like that.  At least I
> don't see what it actually adds.
> 
> Hmm?

The application can write into the buffer. We won't shared queue buffers 
if we could solve the same with a user pointer.

Thanks,
Bernd

Bernd Schubert June 12, 2024, 3:40 p.m. UTC | #33

On 6/12/24 16:19, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>> I will definitely look at it this week. Although I don't like the idea
>> to have a new kthread. We already have an application thread and have
>> the fuse server thread, why do we need another one?
> 
> Ok, I hadn't found the fuse server thread - that should be fine.
> 
>>>
>>> The next thing I was going to look at is how you guys are using splice,
>>> we want to get away from that too.
>>
>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>> could be adapted as well onto the current way of io-uring.
>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>
>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>
>>>
>>> Brian was also saying the fuse virtio_fs code may be worth
>>> investigating, maybe that could be adapted?
>>
>> I need to check, but really, the majority of the new additions
>> is just to set up things, shutdown and to have sanity checks.
>> Request sending/completing to/from the ring is not that much new lines.
> 
> What I'm wondering is how read/write requests are handled. Are the data
> payloads going in the same ringbuffer as the commands? That could work,
> if the ringbuffer is appropriately sized, but alignment is a an issue.

That is exactly the big discussion Miklos and I have. Basically in my
series another buffer is vmalloced, mmaped and then assigned to ring entries.
Fuse meta headers and application payload goes into that buffer.
In both kernel/userspace directions. io-uring only allows 80B, so only a
really small request would fit into it.
Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
header - intrinsically fixed in the ring patches.

I will now try without mmap and just provide a user buffer as pointer in the 80B
section.

> 
> We just looked up the device DMA requirements and with modern NVME only
> 4 byte alignment is required, but the block layer likely isn't set up to
> handle that.

I think existing fuse headers have and their data have a 4 byte alignment.
Maybe even 8 byte, I don't remember without looking through all request types.
If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
without the ring patches it will fail because of alignment. Needs to be fixed
in legacy fuse and would also avoid compat issues we had in libfuse when the
kernel header was updated.

> 
> So - prearranged buffer? Or are you using splice to get pages that
> userspace has read into into the kernel pagecache?

I didn't even try to use splice yet, because for the DDN (my employer) use case
we cannot use  zero copy, at least not without violating the rule that one
cannot access the application buffer in userspace.

I will definitely look into Mings work, as it will be useful for others.

Cheers,
Bernd

Kent Overstreet June 12, 2024, 3:55 p.m. UTC | #34

On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
> On 6/12/24 16:19, Kent Overstreet wrote:
> > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> >> I will definitely look at it this week. Although I don't like the idea
> >> to have a new kthread. We already have an application thread and have
> >> the fuse server thread, why do we need another one?
> > 
> > Ok, I hadn't found the fuse server thread - that should be fine.
> > 
> >>>
> >>> The next thing I was going to look at is how you guys are using splice,
> >>> we want to get away from that too.
> >>
> >> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> >> could be adapted as well onto the current way of io-uring.
> >> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> >>
> >> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> >>
> >>>
> >>> Brian was also saying the fuse virtio_fs code may be worth
> >>> investigating, maybe that could be adapted?
> >>
> >> I need to check, but really, the majority of the new additions
> >> is just to set up things, shutdown and to have sanity checks.
> >> Request sending/completing to/from the ring is not that much new lines.
> > 
> > What I'm wondering is how read/write requests are handled. Are the data
> > payloads going in the same ringbuffer as the commands? That could work,
> > if the ringbuffer is appropriately sized, but alignment is a an issue.
> 
> That is exactly the big discussion Miklos and I have. Basically in my
> series another buffer is vmalloced, mmaped and then assigned to ring entries.
> Fuse meta headers and application payload goes into that buffer.
> In both kernel/userspace directions. io-uring only allows 80B, so only a
> really small request would fit into it.

Well, the generic ringbuffer would lift that restriction.

> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
> header - intrinsically fixed in the ring patches.

*nod*

That's the big question, put the data inline (with potential alignment
hassles) or manage (and map) a separate data structure.

Maybe padding could be inserted to solve alignment?

A separate data structure would only really be useful if it enabled zero
copy, but that should probably be a secondary enhancement.

> I will now try without mmap and just provide a user buffer as pointer in the 80B
> section.
> 
> 
> > 
> > We just looked up the device DMA requirements and with modern NVME only
> > 4 byte alignment is required, but the block layer likely isn't set up to
> > handle that.
> 
> I think existing fuse headers have and their data have a 4 byte alignment.
> Maybe even 8 byte, I don't remember without looking through all request types.
> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
> without the ring patches it will fail because of alignment. Needs to be fixed
> in legacy fuse and would also avoid compat issues we had in libfuse when the
> kernel header was updated.
> 
> > 
> > So - prearranged buffer? Or are you using splice to get pages that
> > userspace has read into into the kernel pagecache?
> 
> I didn't even try to use splice yet, because for the DDN (my employer) use case
> we cannot use  zero copy, at least not without violating the rule that one
> cannot access the application buffer in userspace.

DDN - lustre related?

> 
> I will definitely look into Mings work, as it will be useful for others.
> 
> 
> Cheers,
> Bernd

Bernd Schubert June 12, 2024, 4:15 p.m. UTC | #35

On 6/12/24 17:55, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
>> On 6/12/24 16:19, Kent Overstreet wrote:
>>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>>>> I will definitely look at it this week. Although I don't like the idea
>>>> to have a new kthread. We already have an application thread and have
>>>> the fuse server thread, why do we need another one?
>>>
>>> Ok, I hadn't found the fuse server thread - that should be fine.
>>>
>>>>>
>>>>> The next thing I was going to look at is how you guys are using splice,
>>>>> we want to get away from that too.
>>>>
>>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>>>> could be adapted as well onto the current way of io-uring.
>>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>>>
>>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>>>
>>>>>
>>>>> Brian was also saying the fuse virtio_fs code may be worth
>>>>> investigating, maybe that could be adapted?
>>>>
>>>> I need to check, but really, the majority of the new additions
>>>> is just to set up things, shutdown and to have sanity checks.
>>>> Request sending/completing to/from the ring is not that much new lines.
>>>
>>> What I'm wondering is how read/write requests are handled. Are the data
>>> payloads going in the same ringbuffer as the commands? That could work,
>>> if the ringbuffer is appropriately sized, but alignment is a an issue.
>>
>> That is exactly the big discussion Miklos and I have. Basically in my
>> series another buffer is vmalloced, mmaped and then assigned to ring entries.
>> Fuse meta headers and application payload goes into that buffer.
>> In both kernel/userspace directions. io-uring only allows 80B, so only a
>> really small request would fit into it.
> 
> Well, the generic ringbuffer would lift that restriction.

Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
in that code. At least all that setup code would be moved out of fuse. I will
eventually come to your patches today.
Now we only need to convince Miklos that your ring is better ;)

> 
>> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
>> header - intrinsically fixed in the ring patches.
> 
> *nod*
> 
> That's the big question, put the data inline (with potential alignment
> hassles) or manage (and map) a separate data structure.
> 
> Maybe padding could be inserted to solve alignment?

Right now I have this struct:

struct fuse_ring_req {
	union {
		/* The first 4K are command data */
		char ring_header[FUSE_RING_HEADER_BUF_SIZE];

		struct {
			uint64_t flags;

			/* enum fuse_ring_buf_cmd */
			uint32_t in_out_arg_len;
			uint32_t padding;

			/* kernel fills in, reads out */
			union {
				struct fuse_in_header in;
				struct fuse_out_header out;
			};
		};
	};

	char in_out_arg[];
};


Data go into in_out_arg, i.e. headers are padded by the union.
I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
and not a fixed 4K.

(I just see the stale comment 'enum fuse_ring_buf_cmd',
will remove it in the next series)


> 
> A separate data structure would only really be useful if it enabled zero
> copy, but that should probably be a secondary enhancement.
> 
>> I will now try without mmap and just provide a user buffer as pointer in the 80B
>> section.
>>
>>
>>>
>>> We just looked up the device DMA requirements and with modern NVME only
>>> 4 byte alignment is required, but the block layer likely isn't set up to
>>> handle that.
>>
>> I think existing fuse headers have and their data have a 4 byte alignment.
>> Maybe even 8 byte, I don't remember without looking through all request types.
>> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp
>> without the ring patches it will fail because of alignment. Needs to be fixed
>> in legacy fuse and would also avoid compat issues we had in libfuse when the
>> kernel header was updated.
>>
>>>
>>> So - prearranged buffer? Or are you using splice to get pages that
>>> userspace has read into into the kernel pagecache?
>>
>> I didn't even try to use splice yet, because for the DDN (my employer) use case
>> we cannot use  zero copy, at least not without violating the rule that one
>> cannot access the application buffer in userspace.
> 
> DDN - lustre related?

I have bit of ancient Lustre background, also with DDN, then went to Fraunhofer
for FhGFS/BeeGFS (kind of competing with Lustre).
Back at DDN initially on IME (burst buffer) and now Infinia. Lustre is mostly HPC
only, Infina is kind of everything.

Kent Overstreet June 12, 2024, 4:24 p.m. UTC | #36

On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote:
> 
> 
> On 6/12/24 17:55, Kent Overstreet wrote:
> > On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
> > > On 6/12/24 16:19, Kent Overstreet wrote:
> > > > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
> > > > > I will definitely look at it this week. Although I don't like the idea
> > > > > to have a new kthread. We already have an application thread and have
> > > > > the fuse server thread, why do we need another one?
> > > > 
> > > > Ok, I hadn't found the fuse server thread - that should be fine.
> > > > 
> > > > > > 
> > > > > > The next thing I was going to look at is how you guys are using splice,
> > > > > > we want to get away from that too.
> > > > > 
> > > > > Well, Ming Lei is working on that for ublk_drv and I guess that new approach
> > > > > could be adapted as well onto the current way of io-uring.
> > > > > It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
> > > > > 
> > > > > https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
> > > > > 
> > > > > > 
> > > > > > Brian was also saying the fuse virtio_fs code may be worth
> > > > > > investigating, maybe that could be adapted?
> > > > > 
> > > > > I need to check, but really, the majority of the new additions
> > > > > is just to set up things, shutdown and to have sanity checks.
> > > > > Request sending/completing to/from the ring is not that much new lines.
> > > > 
> > > > What I'm wondering is how read/write requests are handled. Are the data
> > > > payloads going in the same ringbuffer as the commands? That could work,
> > > > if the ringbuffer is appropriately sized, but alignment is a an issue.
> > > 
> > > That is exactly the big discussion Miklos and I have. Basically in my
> > > series another buffer is vmalloced, mmaped and then assigned to ring entries.
> > > Fuse meta headers and application payload goes into that buffer.
> > > In both kernel/userspace directions. io-uring only allows 80B, so only a
> > > really small request would fit into it.
> > 
> > Well, the generic ringbuffer would lift that restriction.
> 
> Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
> in that code. At least all that setup code would be moved out of fuse. I will
> eventually come to your patches today.
> Now we only need to convince Miklos that your ring is better ;)
> 
> > 
> > > Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
> > > header - intrinsically fixed in the ring patches.
> > 
> > *nod*
> > 
> > That's the big question, put the data inline (with potential alignment
> > hassles) or manage (and map) a separate data structure.
> > 
> > Maybe padding could be inserted to solve alignment?
> 
> Right now I have this struct:
> 
> struct fuse_ring_req {
> 	union {
> 		/* The first 4K are command data */
> 		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
> 
> 		struct {
> 			uint64_t flags;
> 
> 			/* enum fuse_ring_buf_cmd */
> 			uint32_t in_out_arg_len;
> 			uint32_t padding;
> 
> 			/* kernel fills in, reads out */
> 			union {
> 				struct fuse_in_header in;
> 				struct fuse_out_header out;
> 			};
> 		};
> 	};
> 
> 	char in_out_arg[];
> };
> 
> 
> Data go into in_out_arg, i.e. headers are padded by the union.
> I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
> and not a fixed 4K.

I would make the commands variable sized, so that commands with no data
buffers don't need padding, and then when you do have a data command you
only pad out that specific command so that the data buffer starts on a
page boundary.

Bernd Schubert June 12, 2024, 4:44 p.m. UTC | #37

On 6/12/24 18:24, Kent Overstreet wrote:
> On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote:
>>
>>
>> On 6/12/24 17:55, Kent Overstreet wrote:
>>> On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote:
>>>> On 6/12/24 16:19, Kent Overstreet wrote:
>>>>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote:
>>>>>> I will definitely look at it this week. Although I don't like the idea
>>>>>> to have a new kthread. We already have an application thread and have
>>>>>> the fuse server thread, why do we need another one?
>>>>>
>>>>> Ok, I hadn't found the fuse server thread - that should be fine.
>>>>>
>>>>>>>
>>>>>>> The next thing I was going to look at is how you guys are using splice,
>>>>>>> we want to get away from that too.
>>>>>>
>>>>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach
>>>>>> could be adapted as well onto the current way of io-uring.
>>>>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV.
>>>>>>
>>>>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/
>>>>>>
>>>>>>>
>>>>>>> Brian was also saying the fuse virtio_fs code may be worth
>>>>>>> investigating, maybe that could be adapted?
>>>>>>
>>>>>> I need to check, but really, the majority of the new additions
>>>>>> is just to set up things, shutdown and to have sanity checks.
>>>>>> Request sending/completing to/from the ring is not that much new lines.
>>>>>
>>>>> What I'm wondering is how read/write requests are handled. Are the data
>>>>> payloads going in the same ringbuffer as the commands? That could work,
>>>>> if the ringbuffer is appropriately sized, but alignment is a an issue.
>>>>
>>>> That is exactly the big discussion Miklos and I have. Basically in my
>>>> series another buffer is vmalloced, mmaped and then assigned to ring entries.
>>>> Fuse meta headers and application payload goes into that buffer.
>>>> In both kernel/userspace directions. io-uring only allows 80B, so only a
>>>> really small request would fit into it.
>>>
>>> Well, the generic ringbuffer would lift that restriction.
>>
>> Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated
>> in that code. At least all that setup code would be moved out of fuse. I will
>> eventually come to your patches today.
>> Now we only need to convince Miklos that your ring is better ;)
>>
>>>
>>>> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse
>>>> header - intrinsically fixed in the ring patches.
>>>
>>> *nod*
>>>
>>> That's the big question, put the data inline (with potential alignment
>>> hassles) or manage (and map) a separate data structure.
>>>
>>> Maybe padding could be inserted to solve alignment?
>>
>> Right now I have this struct:
>>
>> struct fuse_ring_req {
>> 	union {
>> 		/* The first 4K are command data */
>> 		char ring_header[FUSE_RING_HEADER_BUF_SIZE];
>>
>> 		struct {
>> 			uint64_t flags;
>>
>> 			/* enum fuse_ring_buf_cmd */
>> 			uint32_t in_out_arg_len;
>> 			uint32_t padding;
>>
>> 			/* kernel fills in, reads out */
>> 			union {
>> 				struct fuse_in_header in;
>> 				struct fuse_out_header out;
>> 			};
>> 		};
>> 	};
>>
>> 	char in_out_arg[];
>> };
>>
>>
>> Data go into in_out_arg, i.e. headers are padded by the union.
>> I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size
>> and not a fixed 4K.
> 
> I would make the commands variable sized, so that commands with no data
> buffers don't need padding, and then when you do have a data command you
> only pad out that specific command so that the data buffer starts on a
> page boundary.


The same buffer is used for kernel to userspace and the other way around 
- it is attached to the ring entry. Either direction will always have
data, where would a dynamic sizing then be useful?

Well, some "data" like the node id don't need to be aligned - we could 
save memory for that. I still would like to have some padding so that
headers could be grown without any kind of compat issues. Though almost 
4K is probably too much for that.

Thanks for pointing it out, will improve it!

Cheers,
Bernd

Bernd Schubert Aug. 2, 2024, 11:03 p.m. UTC | #38

On 6/12/24 16:56, Bernd Schubert wrote:
> On 6/12/24 16:07, Miklos Szeredi wrote:
>> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>> I didn't do that yet, as we are going to use the ring buffer for requests,
>>> i.e. the ring buffer immediately gets all the data from network, there is
>>> no copy. Even if the ring buffer would get data from local disk - there
>>> is no need to use a separate application buffer anymore. And with that
>>> there is just no extra copy
>>
>> Let's just tackle this shared request buffer, as it seems to be a
>> central part of your design.
>>
>> You say the shared buffer is used to immediately get the data from the
>> network (or various other sources), which is completely viable.
>>
>> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
>>
>> But if the buffer wasn't shared?  What would be the difference?
>> Single copy also.
>>
>> Why is the shared buffer better?  I mean it may even be worse due to
>> cache aliasing issues on certain architectures.  copy_to_user() /
>> copy_from_user() are pretty darn efficient.
> 
> Right now we have:
> 
> - Application thread writes into the buffer, then calls io_uring_cmd_done
> 
> I can try to do without mmap and set a pointer to the user buffer in the 
> 80B section of the SQE. I'm not sure if the application is allowed to 
> write into that buffer, possibly/probably we will be forced to use 
> io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
> anyway). My greatest fear here is that the extra task has performance 
> implications for sync requests.
> 
> 
>>
>> Why is it better to have that buffer managed by kernel?  Being locked
>> in memory (being unswappable) is probably a disadvantage as well.  And
>> if locking is required, it can be done on the user buffer.
> 
> Well, let me try to give the buffer in the 80B section.
> 
>>
>> And there are all the setup and teardown complexities...
> 
> If the buffer in the 80B section works setup becomes easier, mmap and 
> ioctls go away. Teardown, well, we still need the workaround as we need 
> to handle io_uring_cmd_done, but if you could live with that for the 
> instance, I would ask Jens or Pavel or Ming for help if we could solve 
> that in io-uring itself.
> Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
> you have any another idea about it?
> 
>>


Short update, I have this working for some time now with a hack patch
that just adds in a user buffer (without removing mmap, it is just
unused). Initially I thought that is a lot slower, but after removing
all the kernel debug options perf loss is just around 5% and I think I
can get back the remaining by having iov_iter_get_pages2() of the user
buffer in the initialization (with additional code overhead).

I hope to have new patches by mid of next week. I also want to get rid
of the difference of buffer layout between uring and /dev/fuse as that
can be troublesome for other changes like alignment. That might require
an io-uring CQE128, though.


Thanks,
Bernd

Bernd Schubert Aug. 29, 2024, 10:32 p.m. UTC | #39

[I shortened the CC list as that long came up only due to mmap and optimizations]

On 6/12/24 16:56, Bernd Schubert wrote:
> On 6/12/24 16:07, Miklos Szeredi wrote:
>> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote:
>>
>>> I didn't do that yet, as we are going to use the ring buffer for requests,
>>> i.e. the ring buffer immediately gets all the data from network, there is
>>> no copy. Even if the ring buffer would get data from local disk - there
>>> is no need to use a separate application buffer anymore. And with that
>>> there is just no extra copy
>>
>> Let's just tackle this shared request buffer, as it seems to be a
>> central part of your design.
>>
>> You say the shared buffer is used to immediately get the data from the
>> network (or various other sources), which is completely viable.
>>
>> And then the kernel will do the copy from the shared buffer.  Single copy, fine.
>>
>> But if the buffer wasn't shared?  What would be the difference?
>> Single copy also.
>>
>> Why is the shared buffer better?  I mean it may even be worse due to
>> cache aliasing issues on certain architectures.  copy_to_user() /
>> copy_from_user() are pretty darn efficient.
> 
> Right now we have:
> 
> - Application thread writes into the buffer, then calls io_uring_cmd_done
> 
> I can try to do without mmap and set a pointer to the user buffer in the 
> 80B section of the SQE. I'm not sure if the application is allowed to 
> write into that buffer, possibly/probably we will be forced to use 
> io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that 
> anyway). My greatest fear here is that the extra task has performance 
> implications for sync requests.
> 
> 
>>
>> Why is it better to have that buffer managed by kernel?  Being locked
>> in memory (being unswappable) is probably a disadvantage as well.  And
>> if locking is required, it can be done on the user buffer.
> 
> Well, let me try to give the buffer in the 80B section.
> 
>>
>> And there are all the setup and teardown complexities...
> 
> If the buffer in the 80B section works setup becomes easier, mmap and 
> ioctls go away. Teardown, well, we still need the workaround as we need 
> to handle io_uring_cmd_done, but if you could live with that for the 
> instance, I would ask Jens or Pavel or Ming for help if we could solve 
> that in io-uring itself.
> Is the ring workaround in fuse_dev_release() acceptable for you? Or do 
> you have any another idea about it?
> 
>>
>> Note: the ring buffer used by io_uring is different.  It literally
>> allows communication without invoking any system calls in certain
>> cases.  That shared buffer doesn't add anything like that.  At least I
>> don't see what it actually adds.
>>
>> Hmm?
> 
> The application can write into the buffer. We won't shared queue buffers 
> if we could solve the same with a user pointer.


Wanted to send out a new series today, 

https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap

but then just noticed a tear down issue.

 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
[ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
[ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
[ 1525.922470] Workqueue: events io_fallback_req_func
[ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
[ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
[ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
[ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
[ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
[ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
[ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
[ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
[ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
[ 1525.980030] Call Trace:
[ 1525.981371]  <TASK>
[ 1525.982567]  ? __die_body+0x66/0xb0
[ 1525.984376]  ? die_addr+0xc1/0x100
[ 1525.986111]  ? exc_general_protection+0x1c6/0x330
[ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
[ 1525.990864]  ? __lock_acquire+0x74/0x7b80
[ 1525.992901]  ? mark_lock+0x9f/0x360
[ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
[ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
[ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
[ 1526.000515]  ? __bfs+0x2dc/0x5a0
[ 1526.001993]  lock_acquire+0x1fb/0x3d0
[ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.008412]  gup_fast_fallback+0x158/0x1d80
[ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
[ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
[ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
[ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
[ 1526.017734]  iov_iter_get_pages2+0x56/0x70
[ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
[ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
[ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
[ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
[ 1526.027163]  io_fallback_req_func+0xb4/0x170
[ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
[ 1526.030445]  process_scheduled_works+0x85c/0x1160
[ 1526.032073]  worker_thread+0x8ba/0xce0
[ 1526.033388]  kthread+0x23e/0x2b0
[ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
[ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
[ 1526.038321]  ret_from_fork+0x30/0x60
[ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
[ 1526.040942]  ret_from_fork_asm+0x11/0x20
[ 1526.042353]  </TASK>


We probably need to call iov_iter_get_pages2() immediately
on submitting the buffer from fuse server and not only when needed.
I had planned to do that as optimization later on, I think
it is also needed to avoid io_uring_cmd_complete_in_task().

The part I don't like here is that with mmap we had a complex
initialization - but then either it worked or did not. No exceptions
at IO time. And run time was just a copy into the buffer. 
Without mmap initialization is much simpler, but now complexity shifts
to IO time.


Thanks,
Bernd

Jens Axboe Aug. 30, 2024, 1:12 p.m. UTC | #40

On 8/29/24 4:32 PM, Bernd Schubert wrote:
> Wanted to send out a new series today, 
> 
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
> 
> but then just noticed a tear down issue.
> 
>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 1525.922470] Workqueue: events io_fallback_req_func
> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
> [ 1525.980030] Call Trace:
> [ 1525.981371]  <TASK>
> [ 1525.982567]  ? __die_body+0x66/0xb0
> [ 1525.984376]  ? die_addr+0xc1/0x100
> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
> [ 1525.992901]  ? mark_lock+0x9f/0x360
> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
> [ 1526.032073]  worker_thread+0x8ba/0xce0
> [ 1526.033388]  kthread+0x23e/0x2b0
> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
> [ 1526.038321]  ret_from_fork+0x30/0x60
> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
> [ 1526.042353]  </TASK>
> 
> 
> We probably need to call iov_iter_get_pages2() immediately
> on submitting the buffer from fuse server and not only when needed.
> I had planned to do that as optimization later on, I think
> it is also needed to avoid io_uring_cmd_complete_in_task().

I think you do, but it's not really what's wrong here - fallback work is
being invoked as the ring is being torn down, either directly or because
the task is exiting. Your task_work should check if this is the case,
and just do -ECANCELED for this case rather than attempt to execute the
work. Most task_work doesn't do much outside of post a completion, but
yours seems complex in that attempts to map pages as well, for example.
In any case, regardless of whether you move the gup to the actual issue
side of things (which I think you should), then you'd want something
ala:

if (req->task != current)
	don't issue, -ECANCELED

in your task_work.

> The part I don't like here is that with mmap we had a complex
> initialization - but then either it worked or did not. No exceptions
> at IO time. And run time was just a copy into the buffer. 
> Without mmap initialization is much simpler, but now complexity shifts
> to IO time.

I'll take a look at your code. But I'd say just fix the missing check
above and send out what you have, it's much easier to iterate on the
list rather than poking at patches in some git branch somewhere.

Bernd Schubert Aug. 30, 2024, 1:28 p.m. UTC | #41

On 8/30/24 15:12, Jens Axboe wrote:
> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>> Wanted to send out a new series today, 
>>
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
>>
>> but then just noticed a tear down issue.
>>
>>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
>> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
>> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>> [ 1525.922470] Workqueue: events io_fallback_req_func
>> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
>> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
>> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
>> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
>> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
>> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
>> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
>> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
>> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
>> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
>> [ 1525.980030] Call Trace:
>> [ 1525.981371]  <TASK>
>> [ 1525.982567]  ? __die_body+0x66/0xb0
>> [ 1525.984376]  ? die_addr+0xc1/0x100
>> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
>> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
>> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
>> [ 1525.992901]  ? mark_lock+0x9f/0x360
>> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
>> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
>> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
>> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
>> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
>> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
>> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
>> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
>> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
>> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
>> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
>> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
>> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
>> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
>> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
>> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
>> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
>> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
>> [ 1526.032073]  worker_thread+0x8ba/0xce0
>> [ 1526.033388]  kthread+0x23e/0x2b0
>> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
>> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
>> [ 1526.038321]  ret_from_fork+0x30/0x60
>> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
>> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
>> [ 1526.042353]  </TASK>
>>
>>
>> We probably need to call iov_iter_get_pages2() immediately
>> on submitting the buffer from fuse server and not only when needed.
>> I had planned to do that as optimization later on, I think
>> it is also needed to avoid io_uring_cmd_complete_in_task().
> 
> I think you do, but it's not really what's wrong here - fallback work is
> being invoked as the ring is being torn down, either directly or because
> the task is exiting. Your task_work should check if this is the case,
> and just do -ECANCELED for this case rather than attempt to execute the
> work. Most task_work doesn't do much outside of post a completion, but
> yours seems complex in that attempts to map pages as well, for example.
> In any case, regardless of whether you move the gup to the actual issue
> side of things (which I think you should), then you'd want something
> ala:
> 
> if (req->task != current)
> 	don't issue, -ECANCELED
> 
> in your task_work.

Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong 
into __io_uring_cmd_do_in_task then? Because my task_work_cb function 
(passed to io_uring_cmd_complete_in_task) doesn't even have the request.

I'm going to test this in a bit

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 21ac5fb2d5f0..c06b9fcff48f 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
 {
        struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
 
+       if (req->task != current) {
+               /* don't issue, -ECANCELED */
+               return;
+       }
+
        /* task_work executor checks the deffered list completion */
        ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
 }



> 
>> The part I don't like here is that with mmap we had a complex
>> initialization - but then either it worked or did not. No exceptions
>> at IO time. And run time was just a copy into the buffer. 
>> Without mmap initialization is much simpler, but now complexity shifts
>> to IO time.
> 
> I'll take a look at your code. But I'd say just fix the missing check
> above and send out what you have, it's much easier to iterate on the
> list rather than poking at patches in some git branch somewhere.
> 

I'm almost through updating it, will send something out definitely today.
I will just keep the last patch that pins user buffer pages on top of the series 
- will avoid all the rebasing.


Thanks,
Bernd

Jens Axboe Aug. 30, 2024, 1:33 p.m. UTC | #42

On 8/30/24 7:28 AM, Bernd Schubert wrote:
> On 8/30/24 15:12, Jens Axboe wrote:
>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>> Wanted to send out a new series today, 
>>>
>>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap
>>>
>>> but then just noticed a tear down issue.
>>>
>>>  1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7]
>>> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G           O       6.10.0+ #48
>>> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
>>> [ 1525.922470] Workqueue: events io_fallback_req_func
>>> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80
>>> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7
>>> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002
>>> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001
>>> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0
>>> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
>>> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000
>>> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001
>>> [ 1525.968225] FS:  0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000
>>> [ 1525.973932] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0
>>> [ 1525.980030] Call Trace:
>>> [ 1525.981371]  <TASK>
>>> [ 1525.982567]  ? __die_body+0x66/0xb0
>>> [ 1525.984376]  ? die_addr+0xc1/0x100
>>> [ 1525.986111]  ? exc_general_protection+0x1c6/0x330
>>> [ 1525.988401]  ? asm_exc_general_protection+0x22/0x30
>>> [ 1525.990864]  ? __lock_acquire+0x74/0x7b80
>>> [ 1525.992901]  ? mark_lock+0x9f/0x360
>>> [ 1525.994635]  ? __lock_acquire+0x1420/0x7b80
>>> [ 1525.996629]  ? attach_entity_load_avg+0x47d/0x550
>>> [ 1525.998765]  ? hlock_conflict+0x5a/0x1f0
>>> [ 1526.000515]  ? __bfs+0x2dc/0x5a0
>>> [ 1526.001993]  lock_acquire+0x1fb/0x3d0
>>> [ 1526.004727]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.006586]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.008412]  gup_fast_fallback+0x158/0x1d80
>>> [ 1526.010170]  ? gup_fast_fallback+0x13f/0x1d80
>>> [ 1526.011999]  ? __lock_acquire+0x2b07/0x7b80
>>> [ 1526.013793]  __iov_iter_get_pages_alloc+0x36e/0x980
>>> [ 1526.015876]  ? do_raw_spin_unlock+0x5a/0x8a0
>>> [ 1526.017734]  iov_iter_get_pages2+0x56/0x70
>>> [ 1526.019491]  fuse_copy_fill+0x48e/0x980 [fuse]
>>> [ 1526.021400]  fuse_copy_args+0x174/0x6a0 [fuse]
>>> [ 1526.023199]  fuse_uring_prepare_send+0x319/0x6c0 [fuse]
>>> [ 1526.025178]  fuse_uring_send_req_in_task+0x42/0x100 [fuse]
>>> [ 1526.027163]  io_fallback_req_func+0xb4/0x170
>>> [ 1526.028737]  ? process_scheduled_works+0x75b/0x1160
>>> [ 1526.030445]  process_scheduled_works+0x85c/0x1160
>>> [ 1526.032073]  worker_thread+0x8ba/0xce0
>>> [ 1526.033388]  kthread+0x23e/0x2b0
>>> [ 1526.035404]  ? pr_cont_work_flush+0x290/0x290
>>> [ 1526.036958]  ? kthread_blkcg+0xa0/0xa0
>>> [ 1526.038321]  ret_from_fork+0x30/0x60
>>> [ 1526.039600]  ? kthread_blkcg+0xa0/0xa0
>>> [ 1526.040942]  ret_from_fork_asm+0x11/0x20
>>> [ 1526.042353]  </TASK>
>>>
>>>
>>> We probably need to call iov_iter_get_pages2() immediately
>>> on submitting the buffer from fuse server and not only when needed.
>>> I had planned to do that as optimization later on, I think
>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>
>> I think you do, but it's not really what's wrong here - fallback work is
>> being invoked as the ring is being torn down, either directly or because
>> the task is exiting. Your task_work should check if this is the case,
>> and just do -ECANCELED for this case rather than attempt to execute the
>> work. Most task_work doesn't do much outside of post a completion, but
>> yours seems complex in that attempts to map pages as well, for example.
>> In any case, regardless of whether you move the gup to the actual issue
>> side of things (which I think you should), then you'd want something
>> ala:
>>
>> if (req->task != current)
>> 	don't issue, -ECANCELED
>>
>> in your task_work.
> 
> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong 
> into __io_uring_cmd_do_in_task then? Because my task_work_cb function 
> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.

Yeah it probably does, the uring_cmd case is a bit special is that it's
a set of helpers around task_work that can be consumed by eg fuse and
ublk. The existing users don't really do anything complicated on that
side, hence there's no real need to check. But since the ring/task is
going away, we should be able to generically do it in the helpers like
you did below.

> I'm going to test this in a bit
> 
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 21ac5fb2d5f0..c06b9fcff48f 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
>  {
>         struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
>  
> +       if (req->task != current) {
> +               /* don't issue, -ECANCELED */
> +               return;
> +       }
> +
>         /* task_work executor checks the deffered list completion */
>         ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
>  }
> 
> 
> 
>>
>>> The part I don't like here is that with mmap we had a complex
>>> initialization - but then either it worked or did not. No exceptions
>>> at IO time. And run time was just a copy into the buffer. 
>>> Without mmap initialization is much simpler, but now complexity shifts
>>> to IO time.
>>
>> I'll take a look at your code. But I'd say just fix the missing check
>> above and send out what you have, it's much easier to iterate on the
>> list rather than poking at patches in some git branch somewhere.
>>
> 
> I'm almost through updating it, will send something out definitely
> today. I will just keep the last patch that pins user buffer pages on
> top of the series - will avoid all the rebasing.

Excellent! Would be great if you can include how to test it as well, as
then I can give it a spin as well and test any potential code changes
proposed.

Pavel Begunkov Aug. 30, 2024, 2:55 p.m. UTC | #43

On 8/30/24 14:33, Jens Axboe wrote:
> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>> On 8/30/24 15:12, Jens Axboe wrote:
>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>> We probably need to call iov_iter_get_pages2() immediately
>>>> on submitting the buffer from fuse server and not only when needed.
>>>> I had planned to do that as optimization later on, I think
>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>
>>> I think you do, but it's not really what's wrong here - fallback work is
>>> being invoked as the ring is being torn down, either directly or because
>>> the task is exiting. Your task_work should check if this is the case,
>>> and just do -ECANCELED for this case rather than attempt to execute the
>>> work. Most task_work doesn't do much outside of post a completion, but
>>> yours seems complex in that attempts to map pages as well, for example.
>>> In any case, regardless of whether you move the gup to the actual issue
>>> side of things (which I think you should), then you'd want something
>>> ala:
>>>
>>> if (req->task != current)
>>> 	don't issue, -ECANCELED
>>>
>>> in your task_work.nvme_uring_task_cb
>>
>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
> 
> Yeah it probably does, the uring_cmd case is a bit special is that it's
> a set of helpers around task_work that can be consumed by eg fuse and
> ublk. The existing users don't really do anything complicated on that
> side, hence there's no real need to check. But since the ring/task is
> going away, we should be able to generically do it in the helpers like
> you did below.

That won't work, we should give commands an opportunity to clean up
after themselves. I'm pretty sure it will break existing users.
For now we can pass a flag to the callback, fuse would need to
check it and fail. Compile tested only

commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1
Author: Pavel Begunkov <asml.silence@gmail.com>
Date:   Fri Aug 30 15:43:32 2024 +0100

     io_uring/cmd: let cmds tw know about dying task
     
     When the taks that submitted a request is dying, a task work for that
     request might get run by a kernel thread or even worse by a half
     dismantled task. We can't just cancel the task work without running the
     callback as the cmd might need to do some clean up, so pass a flag
     instead. If set, it's not safe to access any task resources and the
     callback is expected to cancel the cmd ASAP.
     
     Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index ace7ac056d51..a89abec98832 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -37,6 +37,7 @@ enum io_uring_cmd_flags {
  	/* set when uring wants to cancel a previously issued command */
  	IO_URING_F_CANCEL		= (1 << 11),
  	IO_URING_F_COMPAT		= (1 << 12),
+	IO_URING_F_TASK_DEAD		= (1 << 13),
  };
  
  struct io_zcrx_ifq;
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 8391c7c7c1ec..55bdcb4b63b3 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
  static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts)
  {
  	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	unsigned flags = IO_URING_F_COMPLETE_DEFER;
+
+	if (req->task->flags & PF_EXITING)
+		flags |= IO_URING_F_TASK_DEAD;
  
  	/* task_work executor checks the deffered list completion */
-	ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
+	ioucmd->task_work_cb(ioucmd, flags);
  }
  
  void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,

Bernd Schubert Aug. 30, 2024, 3:10 p.m. UTC | #44

On 8/30/24 16:55, Pavel Begunkov wrote:
> On 8/30/24 14:33, Jens Axboe wrote:
>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>> I had planned to do that as optimization later on, I think
>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>
>>>> I think you do, but it's not really what's wrong here - fallback
>>>> work is
>>>> being invoked as the ring is being torn down, either directly or
>>>> because
>>>> the task is exiting. Your task_work should check if this is the case,
>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>> yours seems complex in that attempts to map pages as well, for example.
>>>> In any case, regardless of whether you move the gup to the actual issue
>>>> side of things (which I think you should), then you'd want something
>>>> ala:
>>>>
>>>> if (req->task != current)
>>>>     don't issue, -ECANCELED
>>>>
>>>> in your task_work.nvme_uring_task_cb
>>>
>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>
>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>> a set of helpers around task_work that can be consumed by eg fuse and
>> ublk. The existing users don't really do anything complicated on that
>> side, hence there's no real need to check. But since the ring/task is
>> going away, we should be able to generically do it in the helpers like
>> you did below.
> 
> That won't work, we should give commands an opportunity to clean up
> after themselves. I'm pretty sure it will break existing users.
> For now we can pass a flag to the callback, fuse would need to
> check it and fail. Compile tested only
> 
> commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1
> Author: Pavel Begunkov <asml.silence@gmail.com>
> Date:   Fri Aug 30 15:43:32 2024 +0100
> 
>     io_uring/cmd: let cmds tw know about dying task
>         When the taks that submitted a request is dying, a task work for
> that
>     request might get run by a kernel thread or even worse by a half
>     dismantled task. We can't just cancel the task work without running the
>     callback as the cmd might need to do some clean up, so pass a flag
>     instead. If set, it's not safe to access any task resources and the
>     callback is expected to cancel the cmd ASAP.
>         Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> 
> diff --git a/include/linux/io_uring_types.h
> b/include/linux/io_uring_types.h
> index ace7ac056d51..a89abec98832 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -37,6 +37,7 @@ enum io_uring_cmd_flags {
>      /* set when uring wants to cancel a previously issued command */
>      IO_URING_F_CANCEL        = (1 << 11),
>      IO_URING_F_COMPAT        = (1 << 12),
> +    IO_URING_F_TASK_DEAD        = (1 << 13),
>  };
>  
>  struct io_zcrx_ifq;
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 8391c7c7c1ec..55bdcb4b63b3 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable);
>  static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state
> *ts)
>  {
>      struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct
> io_uring_cmd);
> +    unsigned flags = IO_URING_F_COMPLETE_DEFER;
> +
> +    if (req->task->flags & PF_EXITING)
> +        flags |= IO_URING_F_TASK_DEAD;
>  
>      /* task_work executor checks the deffered list completion */
> -    ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER);
> +    ioucmd->task_work_cb(ioucmd, flags);
>  }
>  
>  void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
> 


Thanks and yeah you are right, the previous patch would have missed an
io_uring_cmd_done for fuse-uring as well.


Thanks,
Bernd

Jens Axboe Aug. 30, 2024, 8:08 p.m. UTC | #45

On 8/30/24 8:55 AM, Pavel Begunkov wrote:
> On 8/30/24 14:33, Jens Axboe wrote:
>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>> I had planned to do that as optimization later on, I think
>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>
>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>> being invoked as the ring is being torn down, either directly or because
>>>> the task is exiting. Your task_work should check if this is the case,
>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>> yours seems complex in that attempts to map pages as well, for example.
>>>> In any case, regardless of whether you move the gup to the actual issue
>>>> side of things (which I think you should), then you'd want something
>>>> ala:
>>>>
>>>> if (req->task != current)
>>>>     don't issue, -ECANCELED
>>>>
>>>> in your task_work.nvme_uring_task_cb
>>>
>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>
>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>> a set of helpers around task_work that can be consumed by eg fuse and
>> ublk. The existing users don't really do anything complicated on that
>> side, hence there's no real need to check. But since the ring/task is
>> going away, we should be able to generically do it in the helpers like
>> you did below.
> 
> That won't work, we should give commands an opportunity to clean up
> after themselves. I'm pretty sure it will break existing users.
> For now we can pass a flag to the callback, fuse would need to
> check it and fail. Compile tested only

Right, I did actually consider that yesterday and why I replied with the
fuse callback needing to do it, but then forgot... Since we can't do a
generic cleanup callback, it'll have to be done in the handler.

I do like making this generic and not needing individual task_work
handlers like this checking for some magic, so I like the flag addition.

Bernd Schubert Aug. 31, 2024, 12:02 a.m. UTC | #46

On 8/30/24 22:08, Jens Axboe wrote:
> On 8/30/24 8:55 AM, Pavel Begunkov wrote:
>> On 8/30/24 14:33, Jens Axboe wrote:
>>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>>> I had planned to do that as optimization later on, I think
>>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>>
>>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>>> being invoked as the ring is being torn down, either directly or because
>>>>> the task is exiting. Your task_work should check if this is the case,
>>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>>> yours seems complex in that attempts to map pages as well, for example.
>>>>> In any case, regardless of whether you move the gup to the actual issue
>>>>> side of things (which I think you should), then you'd want something
>>>>> ala:
>>>>>
>>>>> if (req->task != current)
>>>>>     don't issue, -ECANCELED
>>>>>
>>>>> in your task_work.nvme_uring_task_cb
>>>>
>>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>>
>>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>>> a set of helpers around task_work that can be consumed by eg fuse and
>>> ublk. The existing users don't really do anything complicated on that
>>> side, hence there's no real need to check. But since the ring/task is
>>> going away, we should be able to generically do it in the helpers like
>>> you did below.
>>
>> That won't work, we should give commands an opportunity to clean up
>> after themselves. I'm pretty sure it will break existing users.
>> For now we can pass a flag to the callback, fuse would need to
>> check it and fail. Compile tested only
> 
> Right, I did actually consider that yesterday and why I replied with the
> fuse callback needing to do it, but then forgot... Since we can't do a
> generic cleanup callback, it'll have to be done in the handler.
> 
> I do like making this generic and not needing individual task_work
> handlers like this checking for some magic, so I like the flag addition.
> 

Found another issue in (error handling in my code) while working on page 
pinning of the user buffer and fixed that first. Ways to late now (or early)
to continue with the page pinning, but I gave Pavels patch a try with the
additional patch below - same issue. 
I added a warn message to see if triggers - doesn't come up

	if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) {
		pr_warn("IO_URING_F_TASK_DEAD");
		goto terminating;
	}


I could digg further, but I'm actually not sure if we need to. With early page pinning
the entire function should go away, as I hope that the application can write into the
buffer again. Although I'm not sure yet if Miklos will like that pinning.


bschubert2@imesrv6 linux.git>stg show handle-IO_URING_F_TASK_DEAD
commit 42b4dae795bd37918455bad0ce3eea64b28be03c (HEAD -> fuse-uring-for-6.10-rfc3-without-mmap)
Author: Bernd Schubert <bschubert@ddn.com>
Date:   Sat Aug 31 01:26:26 2024 +0200

    fuse: {uring} Handle IO_URING_F_TASK_DEAD
    
    The ring task is terminating, it not safe to still access
    its resources. Also no need for further actions.

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e557f595133b..1d5dfa9c0965 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1003,6 +1003,9 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
        BUILD_BUG_ON(sizeof(pdu) > sizeof(cmd->pdu));
        int err;
 
+       if (unlikely(issue_flags & IO_URING_F_TASK_DEAD))
+               goto terminating;
+
        err = fuse_uring_prepare_send(ring_ent);
        if (err)
                goto err;
@@ -1017,6 +1020,10 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd,
        return;
 err:
        fuse_uring_next_fuse_req(ring_ent, queue);
+
+terminating:
+       /* Avoid all actions as the task that issues the ring is terminating */
+       io_uring_cmd_done(cmd, -ECANCELED, 0, issue_flags);
 }
 
 /* queue a fuse request and send it if a ring entry is available */




Thanks,
Bernd

Bernd Schubert Aug. 31, 2024, 12:49 a.m. UTC | #47

On 8/31/24 02:02, Bernd Schubert wrote:
> 
> 
> On 8/30/24 22:08, Jens Axboe wrote:
>> On 8/30/24 8:55 AM, Pavel Begunkov wrote:
>>> On 8/30/24 14:33, Jens Axboe wrote:
>>>> On 8/30/24 7:28 AM, Bernd Schubert wrote:
>>>>> On 8/30/24 15:12, Jens Axboe wrote:
>>>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote:
>>>>>>> We probably need to call iov_iter_get_pages2() immediately
>>>>>>> on submitting the buffer from fuse server and not only when needed.
>>>>>>> I had planned to do that as optimization later on, I think
>>>>>>> it is also needed to avoid io_uring_cmd_complete_in_task().
>>>>>>
>>>>>> I think you do, but it's not really what's wrong here - fallback work is
>>>>>> being invoked as the ring is being torn down, either directly or because
>>>>>> the task is exiting. Your task_work should check if this is the case,
>>>>>> and just do -ECANCELED for this case rather than attempt to execute the
>>>>>> work. Most task_work doesn't do much outside of post a completion, but
>>>>>> yours seems complex in that attempts to map pages as well, for example.
>>>>>> In any case, regardless of whether you move the gup to the actual issue
>>>>>> side of things (which I think you should), then you'd want something
>>>>>> ala:
>>>>>>
>>>>>> if (req->task != current)
>>>>>>     don't issue, -ECANCELED
>>>>>>
>>>>>> in your task_work.nvme_uring_task_cb
>>>>>
>>>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong
>>>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function
>>>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request.
>>>>
>>>> Yeah it probably does, the uring_cmd case is a bit special is that it's
>>>> a set of helpers around task_work that can be consumed by eg fuse and
>>>> ublk. The existing users don't really do anything complicated on that
>>>> side, hence there's no real need to check. But since the ring/task is
>>>> going away, we should be able to generically do it in the helpers like
>>>> you did below.
>>>
>>> That won't work, we should give commands an opportunity to clean up
>>> after themselves. I'm pretty sure it will break existing users.
>>> For now we can pass a flag to the callback, fuse would need to
>>> check it and fail. Compile tested only
>>
>> Right, I did actually consider that yesterday and why I replied with the
>> fuse callback needing to do it, but then forgot... Since we can't do a
>> generic cleanup callback, it'll have to be done in the handler.
>>
>> I do like making this generic and not needing individual task_work
>> handlers like this checking for some magic, so I like the flag addition.
>>
> 
> Found another issue in (error handling in my code) while working on page 
> pinning of the user buffer and fixed that first. Ways to late now (or early)
> to continue with the page pinning, but I gave Pavels patch a try with the
> additional patch below - same issue. 
> I added a warn message to see if triggers - doesn't come up
> 
> 	if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) {
> 		pr_warn("IO_URING_F_TASK_DEAD");
> 		goto terminating;
> 	}
> 
> 
> I could digg further, but I'm actually not sure if we need to. With early page pinning
> the entire function should go away, as I hope that the application can write into the
> buffer again. Although I'm not sure yet if Miklos will like that pinning.

Works with page pinning, new series comes once I got some sleep (still
need to write the change log).


Thanks,
Bernd

[RFC,v2,00/19] fuse: fuse-over-io-uring

Message

Comments