mbox series

[RFC,v4,00/15] fuse: fuse-over-io-uring

Message ID 20241016-fuse-uring-for-6-10-rfc4-v4-0-9739c753666e@ddn.com (mailing list archive)
Headers show
Series fuse: fuse-over-io-uring | expand

Message

Bernd Schubert Oct. 16, 2024, 12:05 a.m. UTC
This adds support for uring communication between kernel and
userspace daemon using opcode the IORING_OP_URING_CMD. The basic
appraoch was taken from ublk.  The patches are in RFC state,
some major changes are still to be expected.

Motivation for these patches is all to increase fuse performance.
In fuse-over-io-uring requests avoid core switching (application
on core X, processing of fuse server on random core Y) and use
shared memory between kernel and userspace to transfer data.
Similar approaches have been taken by ZUFS and FUSE2, though
not over io-uring, but through ioctl IOs

https://lwn.net/Articles/756625/
https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2

Avoiding cache line bouncing / numa systems was discussed
between Amir and Miklos before and Miklos had posted
part of the private discussion here
https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/

This cache line bouncing should be reduced by these patches, as
a) Switching between kernel and userspace is reduced by 50%,
as the request fetch (by read) and result commit (write) is replaced
by a single and submit and fetch command
b) Submitting via ring can avoid context switches at all.
Note: As of now userspace still needs to transition to the kernel to
wake up the submit the result, though it might be possible to
avoid that as well (for example either with IORING_SETUP_SQPOLL
(basic testing did not show performance advantage for now) or
the task that is submitting fuse requests to the ring could also
poll for results (needs additional work).

I had also noticed waitq wake-up latencies in fuse before
https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/

This spinning approach helped with performance (>40% improvement
for file creates), but due to random server side thread/core utilization
spinning cannot be well controlled in /dev/fuse mode.
With fuse-over-io-uring requests are handled on the same core
(sync requests) or on core+1 (large async requests) and performance
improvements are achieved without spinning.

Splice/zero-copy is not supported yet, Ming Lei is working
on io-uring support for ublk_drv, we can probably also use
that approach for fuse and get better zero copy than splice.
https://lore.kernel.org/io-uring/20240808162438.345884-1-ming.lei@redhat.com/

RFCv1 and RFCv2 have been tested with multiple xfstest runs in a VM
(32 cores) with a kernel that has several debug options
enabled (like KASAN and MSAN). RFCv3 is not that well tested yet.
O_DIRECT is currently not working well with /dev/fuse and
also these patches, a patch has been submitted to fix that (although
the approach is refused)
https://www.spinics.net/lists/linux-fsdevel/msg280028.html

Up the to RFCv2 nice effect in io-uring mode was that xftests run faster
(like generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still
slow as this is with ASAN/leak-detection/etc.
With RFCv3 and removed mmap overall run time as approximately the same,
though some optimizations are removed in RFCv3, like submitting to
the ring from the task that created the fuse request (hence, without
io_uring_cmd_complete_in_task()).

The corresponding libfuse patches are on my uring branch,
but need cleanup for submission - will happen during the next
days.
https://github.com/bsbernd/libfuse/tree/uring

Testing with that libfuse branch is possible by running something
like:

example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
--uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
/scratch/source /scratch/dest

With the --debug-fuse option one should see CQE in the request type,
if requests are received via io-uring:

cqe unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 16, pid: 7060
    unique: 4, result=104

Without the --uring option "cqe" is replaced by the default "dev"

dev unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 7117
   unique: 4, success, outsize: 120

TODO list for next RFC version
- make the buffer layout exactly the same as /dev/fuse IO
- different request size - a large ring queue size currently needs
too much memory, even if most of the queue size is needed for small
IOs

Future work
- zero copy

I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023,
which, resulted in some tuning patches (at the end of the
patch series).

Benchmark results (with RFC v1)
=======================================

System used for the benchmark is a 32 core (HyperThreading enabled)
Xeon E5-2650 system. I don't have local disks attached that could do
>5GB/s IOs, for paged and dio results a patched version of passthrough-hp
was used that bypasses final reads/writes.

paged reads
-----------
            128K IO size                      1024K IO size
jobs   /dev/fuse     uring    gain     /dev/fuse    uring   gain
 1        1117        1921    1.72        1902       1942   1.02
 2        2502        3527    1.41        3066       3260   1.06
 4        5052        6125    1.21        5994       6097   1.02
 8        6273       10855    1.73        7101      10491   1.48
16        6373       11320    1.78        7660      11419   1.49
24        6111        9015    1.48        7600       9029   1.19
32        5725        7968    1.39        6986       7961   1.14

dio reads (1024K)
-----------------

jobs   /dev/fuse  uring   gain
1	    2023   3998	  2.42
2	    3375   7950   2.83
4	    3823   15022  3.58
8	    7796   22591  2.77
16	    8520   27864  3.27
24	    8361   20617  2.55
32	    8717   12971  1.55

mmap reads (4K)
---------------
(sequential, I probably should have made it random, sequential exposes
a rather interesting/weird 'optimized' memcpy issue - sequential becomes
reversed order 4K read)
https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/

jobs  /dev/fuse     uring    gain
1       130          323     2.49
2       219          538     2.46
4       503         1040     2.07
8       1472        2039     1.38
16      2191        3518     1.61
24      2453        4561     1.86
32      2178        5628     2.58

(Results on request, setting MAP_HUGETLB much improves performance
for both, io-uring mode then has a slight advantage only.)

creates/s
----------
threads /dev/fuse     uring   gain
1          3944       10121   2.57
2          8580       24524   2.86
4         16628       44426   2.67
8         46746       56716   1.21
16        79740      102966   1.29
20        80284      119502   1.49

(the gain drop with >=8 cores needs to be investigated)

Jens had done some benchmarks with v3 and noticed only 
25% improvement and half of CPU time usage, but v3
removes several optimizations (like waking the same core
and avoiding task io_uring_cmd_done in extra task context).
These optimizations will be submitted once the core work
is merged.

Remaining TODO list for RFCv3:
--------------------------------
1) Let the ring configure ioctl return information,
like mmap/queue-buf size

Right now libfuse and kernel have lots of duplicated setup code
and any kind of pointer/offset mismatch results in a non-working
ring that is hard to debug - probably better when the kernel does
the calculations and returns that to server side

2) In combination with 1, ring requests should retrieve their
userspace address and length from kernel side instead of
calculating it through the mmaped queue buffer on their own.
(Introduction of FUSE_URING_BUF_ADDR_FETCH)

3) Add log buffer into the ioctl and ring-request

This is to provide better error messages (instead of just
errno)

3) Multiple IO sizes per queue

Small IOs and metadata requests do not need large buffer sizes,
we need multiple IO sizes per queue.

4) FUSE_INTERRUPT handling

These are not handled yet, kernel side is probably not difficult
anymore as ring entries take fuse requests through lists.

TODO:
======

- separate buffer for fuse headers to always handle alignment

Signed-off-by: Bernd Schubert <bschubert@ddn.com>
---
Changes in v4:
- Removal of ioctls, all configuration is done dynamically
  on the arrival of FUSE_URING_REQ_FETCH
- ring entries are not (and cannot be without config ioctls)
  allocated as array of the ring/queue - removal of the tag
  variable. Finding ring entries on FUSE_URING_REQ_COMMIT_AND_FETCH
  is more cumbersome now and needs an almost unused 
  struct fuse_pqueue per fuse_ring_queue and uses the unique
  id of fuse requests.
- No device clones needed for to workaroung hanging mounts
  on fuse-server/daemon termination, handled by IO_URING_F_CANCEL
- Removal of sync/async ring entry types
- Addressed some of Joannes comments, but probably not all
- Only very basic tests run for v3, as more updates should follow quickly.

Changes in v3
- Removed the __wake_on_current_cpu optimization (for now
  as that needs to go through another subsystem/tree) ,
  removing it means a significant performance drop)
- Removed MMAP (Miklos)
- Switched to two IOCTLs, instead of one ioctl that had a field
  for subcommands (ring and queue config) (Miklos)
- The ring entry state is a single state and not a bitmask anymore
  (Josef)
- Addressed several other comments from Josef (I need to go over
  the RFCv2 review again, I'm not sure if everything is addressed
  already)

- Link to v3: https://lore.kernel.org/r/20240901-b4-fuse-uring-rfcv3-without-mmap-v3-0-9207f7391444@ddn.com
- Link to v2: https://lore.kernel.org/all/20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com/
- Link to v1: https://lore.kernel.org/r/20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com

---
Bernd Schubert (14):
      fuse: rename to fuse_dev_end_requests and make non-static
      fuse: Move fuse_get_dev to header file
      fuse: Move request bits
      fuse: Add fuse-io-uring design documentation
      fuse: {uring} Handle SQEs - register commands
      fuse: Make fuse_copy non static
      fuse: Add buffer offset for uring into fuse_copy_state
      fuse: {uring} Add uring sqe commit and fetch support
      fuse: {uring} Handle teardown of ring entries
      fuse: {uring} Add a ring queue and send method
      fuse: {uring} Allow to queue to the ring
      fuse: {uring} Handle IO_URING_F_TASK_DEAD
      fuse: {io-uring} Prevent mount point hang on fuse-server termination
      fuse: enable fuse-over-io-uring

Pavel Begunkov (1):
      io_uring/cmd: let cmds to know about dying task

 Documentation/filesystems/fuse-io-uring.rst |  101 +++
 fs/fuse/Kconfig                             |   12 +
 fs/fuse/Makefile                            |    1 +
 fs/fuse/dev.c                               |  155 ++--
 fs/fuse/dev_uring.c                         | 1130 +++++++++++++++++++++++++++
 fs/fuse/dev_uring_i.h                       |  199 +++++
 fs/fuse/fuse_dev_i.h                        |   64 ++
 fs/fuse/fuse_i.h                            |   14 +
 fs/fuse/inode.c                             |    5 +-
 include/linux/io_uring_types.h              |    1 +
 include/uapi/linux/fuse.h                   |   70 ++
 io_uring/uring_cmd.c                        |    6 +-
 12 files changed, 1706 insertions(+), 52 deletions(-)
---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20241015-fuse-uring-for-6-10-rfc4-61d0fc6851f8

Best regards,

Comments

Bernd Schubert Oct. 16, 2024, 12:08 a.m. UTC | #1
Please note that this is a preview only to show the current status. 
V5 should follow soon to separate the headers into its own buffer. 
I actually hope that this v4 is the last RFC version.


Thanks,
Bernd
David Wei Oct. 21, 2024, 4:06 a.m. UTC | #2
On 2024-10-15 17:05, Bernd Schubert wrote:
[...]
> 
> The corresponding libfuse patches are on my uring branch,
> but need cleanup for submission - will happen during the next
> days.
> https://github.com/bsbernd/libfuse/tree/uring
> 
> Testing with that libfuse branch is possible by running something
> like:
> 
> example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
> --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
> /scratch/source /scratch/dest
> 
> With the --debug-fuse option one should see CQE in the request type,
> if requests are received via io-uring:
> 
> cqe unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 16, pid: 7060
>     unique: 4, result=104
> 
> Without the --uring option "cqe" is replaced by the default "dev"
> 
> dev unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 56, pid: 7117
>    unique: 4, success, outsize: 120

Hi Bernd, I applied this patchset to io_uring-6.12 branch with some
minor conflicts. I'm running the following command:

$ sudo ./build/example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
--uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
/home/vmuser/scratch/source /home/vmuser/scratch/dest
FUSE library version: 3.17.0
Creating ring per-core-queue=1 sync-depth=1 async-depth=1 arglen=1052672
dev unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
INIT: 7.40
flags=0x73fffffb
max_readahead=0x00020000
   INIT: 7.40
   flags=0x4041f429
   max_readahead=0x00020000
   max_write=0x00100000
   max_background=0
   congestion_threshold=0
   time_gran=1
   unique: 2, success, outsize: 80

I created the source and dest folders which are both empty.

I see the following in dmesg:

[ 2453.197510] uring is disabled
[ 2453.198525] uring is disabled
[ 2453.198749] uring is disabled
...

If I then try to list the directory /home/vmuser/scratch:

$ ls -l /home/vmuser/scratch
ls: cannot access 'dest': Software caused connection abort

And passthrough_hp terminates.

My kconfig:

CONFIG_FUSE_FS=m
CONFIG_FUSE_PASSTHROUGH=y
CONFIG_FUSE_IO_URING=y

I'll look into it next week but, do you see anything obviously wrong?
Bernd Schubert Oct. 21, 2024, 11:47 a.m. UTC | #3
Hi David,

On 10/21/24 06:06, David Wei wrote:
> [You don't often get email from dw@davidwei.uk. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On 2024-10-15 17:05, Bernd Schubert wrote:
> [...]
>>

...

> Hi Bernd, I applied this patchset to io_uring-6.12 branch with some
> minor conflicts. I'm running the following command:
> 
> $ sudo ./build/example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
> --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
> /home/vmuser/scratch/source /home/vmuser/scratch/dest
> FUSE library version: 3.17.0
> Creating ring per-core-queue=1 sync-depth=1 async-depth=1 arglen=1052672
> dev unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
> INIT: 7.40
> flags=0x73fffffb
> max_readahead=0x00020000
>     INIT: 7.40
>     flags=0x4041f429
>     max_readahead=0x00020000
>     max_write=0x00100000
>     max_background=0
>     congestion_threshold=0
>     time_gran=1
>     unique: 2, success, outsize: 80
> 
> I created the source and dest folders which are both empty.
> 
> I see the following in dmesg:
> 
> [ 2453.197510] uring is disabled
> [ 2453.198525] uring is disabled
> [ 2453.198749] uring is disabled
> ...
> 
> If I then try to list the directory /home/vmuser/scratch:
> 
> $ ls -l /home/vmuser/scratch
> ls: cannot access 'dest': Software caused connection abort
> 
> And passthrough_hp terminates.
> 
> My kconfig:
> 
> CONFIG_FUSE_FS=m
> CONFIG_FUSE_PASSTHROUGH=y
> CONFIG_FUSE_IO_URING=y
> 
> I'll look into it next week but, do you see anything obviously wrong?


thanks for testing it! I just pushed a fix to my libfuse branches to
avoid the abort for -EOPNOTSUPP. It will gracefully fall back to
/dev/fuse IO now.

Could you please use the rfcv4 branch, as the plain uring
branch will soon get incompatible updates for rfc5?

https://github.com/bsbernd/libfuse/tree/uring-for-rfcv4


The short answer to let you enable fuse-io-uring:

echo 1 >/sys/module/fuse/parameters/enable_uring


(With that the "uring is disabled" should be fixed.)


The long answer for Miklos and others


IOCTL removal introduced a design issue, as now fuse-client
(kernel) does not know if fuse-server/libfuse wants to set
up io-uring communication.
It is not even possible to forbid FUSE_URING_REQ_FETCH after
FUSE_INIT reply, as io-uring is async. What happens is that
fuse-client (kernel) receives all FUSE_URING_REQ_FETCH commands
only after FUSE_INIT reply. And that although FUSE_URING_REQ_FETCH
is send out from libuse *before* replying to FUSE_INIT.
I had also added a comment for that into the code.

And the other issue is that libfuse now does not know if kernel supports
fuse-io-uring. That has some implications
- libfuse cannot write at start up time a clear error message like
"Kernel does not support fuse-over-io-uring, falling back to /dev/fuse IO"
- In the fallback code path one might want to adjust number of libfuse
/dev/fuse threads if io-uring is not supported - with io-uring typically
one thread might be sufficient - to handle FUSE_INTERRUPT.


My suggestion is that we introduce the new FUSE_URING_REQ_REGISTER (or
replace FUSE_URING_REQ_FETCH with that) and then wait in fuse-server
for completion of that command before sending out FUSE_URING_REQ_FETCH.


Thanks,
Bernd
David Wei Oct. 21, 2024, 8:57 p.m. UTC | #4
On 2024-10-21 04:47, Bernd Schubert wrote:
> Hi David,
> 
> On 10/21/24 06:06, David Wei wrote:
>> [You don't often get email from dw@davidwei.uk. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On 2024-10-15 17:05, Bernd Schubert wrote:
>> [...]
>>>
> 
> ...
> 
>> Hi Bernd, I applied this patchset to io_uring-6.12 branch with some
>> minor conflicts. I'm running the following command:
>>
>> $ sudo ./build/example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
>> --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
>> /home/vmuser/scratch/source /home/vmuser/scratch/dest
>> FUSE library version: 3.17.0
>> Creating ring per-core-queue=1 sync-depth=1 async-depth=1 arglen=1052672
>> dev unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
>> INIT: 7.40
>> flags=0x73fffffb
>> max_readahead=0x00020000
>>     INIT: 7.40
>>     flags=0x4041f429
>>     max_readahead=0x00020000
>>     max_write=0x00100000
>>     max_background=0
>>     congestion_threshold=0
>>     time_gran=1
>>     unique: 2, success, outsize: 80
>>
>> I created the source and dest folders which are both empty.
>>
>> I see the following in dmesg:
>>
>> [ 2453.197510] uring is disabled
>> [ 2453.198525] uring is disabled
>> [ 2453.198749] uring is disabled
>> ...
>>
>> If I then try to list the directory /home/vmuser/scratch:
>>
>> $ ls -l /home/vmuser/scratch
>> ls: cannot access 'dest': Software caused connection abort
>>
>> And passthrough_hp terminates.
>>
>> My kconfig:
>>
>> CONFIG_FUSE_FS=m
>> CONFIG_FUSE_PASSTHROUGH=y
>> CONFIG_FUSE_IO_URING=y
>>
>> I'll look into it next week but, do you see anything obviously wrong?
> 
> 
> thanks for testing it! I just pushed a fix to my libfuse branches to
> avoid the abort for -EOPNOTSUPP. It will gracefully fall back to
> /dev/fuse IO now.
> 
> Could you please use the rfcv4 branch, as the plain uring
> branch will soon get incompatible updates for rfc5?
> 
> https://github.com/bsbernd/libfuse/tree/uring-for-rfcv4
> 
> 
> The short answer to let you enable fuse-io-uring:
> 
> echo 1 >/sys/module/fuse/parameters/enable_uring
> 
> 
> (With that the "uring is disabled" should be fixed.)

Thanks, using this branch fixed the issue and now I can see the dest
folder mirroring that of the source folder. There are two issues I
noticed:

[63490.068211] ---[ end trace 0000000000000000 ]---
[64010.242963] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:330
[64010.243531] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 11057, name: fuse-ring-1
[64010.244092] preempt_count: 1, expected: 0
[64010.244346] RCU nest depth: 0, expected: 0
[64010.244599] 2 locks held by fuse-ring-1/11057:
[64010.244886]  #0: ffff888105db20a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x900/0xd80
[64010.245476]  #1: ffff88810f941818 (&fc->lock){+.+.}-{2:2}, at: fuse_uring_cmd+0x83e/0x1890 [fuse]
[64010.246031] CPU: 1 UID: 0 PID: 11057 Comm: fuse-ring-1 Tainted: G        W          6.11.0-10089-g0d2090ccdbbe #2
[64010.246655] Tainted: [W]=WARN
[64010.246853] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[64010.247542] Call Trace:
[64010.247705]  <TASK>
[64010.247860]  dump_stack_lvl+0xb0/0xd0
[64010.248090]  __might_resched+0x2f8/0x510
[64010.248338]  __kmalloc_cache_noprof+0x2aa/0x390
[64010.248614]  ? lockdep_init_map_type+0x2cb/0x7b0
[64010.248923]  ? fuse_uring_cmd+0xcc2/0x1890 [fuse]
[64010.249215]  fuse_uring_cmd+0xcc2/0x1890 [fuse]
[64010.249506]  io_uring_cmd+0x214/0x500
[64010.249745]  io_issue_sqe+0x588/0x1810
[64010.249999]  ? __pfx_io_issue_sqe+0x10/0x10
[64010.250254]  ? io_alloc_async_data+0x88/0x120
[64010.250516]  ? io_alloc_async_data+0x88/0x120
[64010.250811]  ? io_uring_cmd_prep+0x2eb/0x9f0
[64010.251103]  io_submit_sqes+0x796/0x1f80
[64010.251387]  __do_sys_io_uring_enter+0x90a/0xd80
[64010.251696]  ? do_user_addr_fault+0x26f/0xb60
[64010.251991]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
[64010.252333]  ? __up_read+0x3ba/0x750
[64010.252565]  ? __pfx___up_read+0x10/0x10
[64010.252868]  do_syscall_64+0x68/0x140
[64010.253121]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[64010.253444] RIP: 0033:0x7f03a03fb7af
[64010.253679] Code: 45 0f b6 90 d0 00 00 00 41 8b b8 cc 00 00 00 45 31 c0 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> a8 02 74 cc f0 48 83 0c 24 00 49 8b 40 20 8b 00 a8 01 74 bc b8
[64010.254801] RSP: 002b:00007f039f3ffd08 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[64010.255261] RAX: ffffffffffffffda RBX: 0000561ab7c1ced0 RCX: 00007f03a03fb7af
[64010.255695] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000009
[64010.256127] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000008
[64010.256556] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561ab7c1d7a8
[64010.256990] R13: 0000561ab7c1da00 R14: 0000561ab7c1d520 R15: 0000000000000001
[64010.257442]  </TASK>

If I am already in dest when I do the mount using passthrough_hp and
then e.g. ls, it hangs indefinitely even if I kill passthrough_hp.
Bernd Schubert Oct. 22, 2024, 10:24 a.m. UTC | #5
On 10/21/24 22:57, David Wei wrote:
> On 2024-10-21 04:47, Bernd Schubert wrote:
>> Hi David,
>>
>> On 10/21/24 06:06, David Wei wrote:
>>> [You don't often get email from dw@davidwei.uk. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On 2024-10-15 17:05, Bernd Schubert wrote:
>>> [...]
>>>>
>>
>> ...
>>
>>> Hi Bernd, I applied this patchset to io_uring-6.12 branch with some
>>> minor conflicts. I'm running the following command:
>>>
>>> $ sudo ./build/example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
>>> --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
>>> /home/vmuser/scratch/source /home/vmuser/scratch/dest
>>> FUSE library version: 3.17.0
>>> Creating ring per-core-queue=1 sync-depth=1 async-depth=1 arglen=1052672
>>> dev unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
>>> INIT: 7.40
>>> flags=0x73fffffb
>>> max_readahead=0x00020000
>>>      INIT: 7.40
>>>      flags=0x4041f429
>>>      max_readahead=0x00020000
>>>      max_write=0x00100000
>>>      max_background=0
>>>      congestion_threshold=0
>>>      time_gran=1
>>>      unique: 2, success, outsize: 80
>>>
>>> I created the source and dest folders which are both empty.
>>>
>>> I see the following in dmesg:
>>>
>>> [ 2453.197510] uring is disabled
>>> [ 2453.198525] uring is disabled
>>> [ 2453.198749] uring is disabled
>>> ...
>>>
>>> If I then try to list the directory /home/vmuser/scratch:
>>>
>>> $ ls -l /home/vmuser/scratch
>>> ls: cannot access 'dest': Software caused connection abort
>>>
>>> And passthrough_hp terminates.
>>>
>>> My kconfig:
>>>
>>> CONFIG_FUSE_FS=m
>>> CONFIG_FUSE_PASSTHROUGH=y
>>> CONFIG_FUSE_IO_URING=y
>>>
>>> I'll look into it next week but, do you see anything obviously wrong?
>>
>>
>> thanks for testing it! I just pushed a fix to my libfuse branches to
>> avoid the abort for -EOPNOTSUPP. It will gracefully fall back to
>> /dev/fuse IO now.
>>
>> Could you please use the rfcv4 branch, as the plain uring
>> branch will soon get incompatible updates for rfc5?
>>
>> https://github.com/bsbernd/libfuse/tree/uring-for-rfcv4
>>
>>
>> The short answer to let you enable fuse-io-uring:
>>
>> echo 1 >/sys/module/fuse/parameters/enable_uring
>>
>>
>> (With that the "uring is disabled" should be fixed.)
> 
> Thanks, using this branch fixed the issue and now I can see the dest
> folder mirroring that of the source folder. There are two issues I
> noticed:
> 
> [63490.068211] ---[ end trace 0000000000000000 ]---
> [64010.242963] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:330
> [64010.243531] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 11057, name: fuse-ring-1
> [64010.244092] preempt_count: 1, expected: 0
> [64010.244346] RCU nest depth: 0, expected: 0
> [64010.244599] 2 locks held by fuse-ring-1/11057:
> [64010.244886]  #0: ffff888105db20a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x900/0xd80
> [64010.245476]  #1: ffff88810f941818 (&fc->lock){+.+.}-{2:2}, at: fuse_uring_cmd+0x83e/0x1890 [fuse]
> [64010.246031] CPU: 1 UID: 0 PID: 11057 Comm: fuse-ring-1 Tainted: G        W          6.11.0-10089-g0d2090ccdbbe #2
> [64010.246655] Tainted: [W]=WARN
> [64010.246853] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> [64010.247542] Call Trace:
> [64010.247705]  <TASK>
> [64010.247860]  dump_stack_lvl+0xb0/0xd0
> [64010.248090]  __might_resched+0x2f8/0x510
> [64010.248338]  __kmalloc_cache_noprof+0x2aa/0x390
> [64010.248614]  ? lockdep_init_map_type+0x2cb/0x7b0
> [64010.248923]  ? fuse_uring_cmd+0xcc2/0x1890 [fuse]
> [64010.249215]  fuse_uring_cmd+0xcc2/0x1890 [fuse]
> [64010.249506]  io_uring_cmd+0x214/0x500
> [64010.249745]  io_issue_sqe+0x588/0x1810
> [64010.249999]  ? __pfx_io_issue_sqe+0x10/0x10
> [64010.250254]  ? io_alloc_async_data+0x88/0x120
> [64010.250516]  ? io_alloc_async_data+0x88/0x120
> [64010.250811]  ? io_uring_cmd_prep+0x2eb/0x9f0
> [64010.251103]  io_submit_sqes+0x796/0x1f80
> [64010.251387]  __do_sys_io_uring_enter+0x90a/0xd80
> [64010.251696]  ? do_user_addr_fault+0x26f/0xb60
> [64010.251991]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
> [64010.252333]  ? __up_read+0x3ba/0x750
> [64010.252565]  ? __pfx___up_read+0x10/0x10
> [64010.252868]  do_syscall_64+0x68/0x140
> [64010.253121]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [64010.253444] RIP: 0033:0x7f03a03fb7af
> [64010.253679] Code: 45 0f b6 90 d0 00 00 00 41 8b b8 cc 00 00 00 45 31 c0 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> a8 02 74 cc f0 48 83 0c 24 00 49 8b 40 20 8b 00 a8 01 74 bc b8
> [64010.254801] RSP: 002b:00007f039f3ffd08 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
> [64010.255261] RAX: ffffffffffffffda RBX: 0000561ab7c1ced0 RCX: 00007f03a03fb7af
> [64010.255695] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000009
> [64010.256127] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000008
> [64010.256556] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561ab7c1d7a8
> [64010.256990] R13: 0000561ab7c1da00 R14: 0000561ab7c1d520 R15: 0000000000000001
> [64010.257442]  </TASK>

Regarding issue one, does this patch solve it?

diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index e518d4379aa1..304919bc12fb 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -168,6 +168,12 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
         queue = kzalloc(sizeof(*queue), GFP_KERNEL_ACCOUNT);
         if (!queue)
                 return ERR_PTR(-ENOMEM);
+       pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL);
+       if (!pq) {
+               kfree(queue);
+               return ERR_PTR(-ENOMEM);
+       }
+
         spin_lock(&fc->lock);
         if (ring->queues[qid]) {
                 spin_unlock(&fc->lock);
@@ -186,11 +192,6 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
         INIT_LIST_HEAD(&queue->ent_in_userspace);
         INIT_LIST_HEAD(&queue->fuse_req_queue);

-       pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL);
-       if (!pq) {
-               kfree(queue);
-               return ERR_PTR(-ENOMEM);
-       }
         queue->fpq.processing = pq;
         fuse_pqueue_init(&queue->fpq);


I think we don't need GFP_ATOMIC, but can do allocations before taking
the lock. This pq allocation is new in v4 and I forgot to put it into
the right place and it slipped through my very basic testing (I'm
concentrating on the design changes for now - testing will come back
with v6).

> 
> If I am already in dest when I do the mount using passthrough_hp and
> then e.g. ls, it hangs indefinitely even if I kill passthrough_hp.

I'm going to check in a bit. I hope it is not a recursion issue.


Thanks,
Bernd
Bernd Schubert Oct. 22, 2024, 12:46 p.m. UTC | #6
On 10/22/24 12:24, Bernd Schubert wrote:
> On 10/21/24 22:57, David Wei wrote:
>> If I am already in dest when I do the mount using passthrough_hp and
>> then e.g. ls, it hangs indefinitely even if I kill passthrough_hp.
> 
> I'm going to check in a bit. I hope it is not a recursion issue.
> 

Hmm, I cannot reproduce this

bernd@squeeze1 dest>pwd
/scratch/dest

bernd@squeeze1 dest>/home/bernd/src/libfuse/github//build-debian/example/passthrough_hp -o allow_other --nopassthrough --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 /scratch/source /scratch/dest

bernd@squeeze1 dest>ll
total 6.4G
drwxr-xr-x 2 fusetests fusetests 4.0K Jul 30 17:59 scratch_mnt
drwxr-xr-x 2 fusetests fusetests 4.0K Jul 30 17:59 test_dir
-rw-r--r-- 1 bernd     bernd      50G Sep 12 14:20 testfile
-rwxr-xr-x 1 bernd     bernd     6.3G Sep 12 14:39 testfile1


Same when running in foreground and doing operations from another console


cqe unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 16, pid: 732
     unique: 4, result=104
cqe unique: 6, opcode: STATFS (17), nodeid: 1, insize: 0, pid: 732
     unique: 6, result=80


In order to check it is not a recursion issue I also switched my VM to
one core - still no issue. What is your setup?
Also, I'm still on 6.10, I want to send out v5 with separated headers
later this week and next week v6 (and maybe without RFC) for 6.12 next
week.


Thanks,
Bernd
David Wei Oct. 22, 2024, 5:10 p.m. UTC | #7
On 2024-10-22 05:46, Bernd Schubert wrote:
> 
> 
> On 10/22/24 12:24, Bernd Schubert wrote:
>> On 10/21/24 22:57, David Wei wrote:
>>> If I am already in dest when I do the mount using passthrough_hp and
>>> then e.g. ls, it hangs indefinitely even if I kill passthrough_hp.
>>
>> I'm going to check in a bit. I hope it is not a recursion issue.
>>
> 
> Hmm, I cannot reproduce this
> 
> bernd@squeeze1 dest>pwd
> /scratch/dest
> 
> bernd@squeeze1 dest>/home/bernd/src/libfuse/github//build-debian/example/passthrough_hp -o allow_other --nopassthrough --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 /scratch/source /scratch/dest
> 
> bernd@squeeze1 dest>ll
> total 6.4G
> drwxr-xr-x 2 fusetests fusetests 4.0K Jul 30 17:59 scratch_mnt
> drwxr-xr-x 2 fusetests fusetests 4.0K Jul 30 17:59 test_dir
> -rw-r--r-- 1 bernd     bernd      50G Sep 12 14:20 testfile
> -rwxr-xr-x 1 bernd     bernd     6.3G Sep 12 14:39 testfile1
> 
> 
> Same when running in foreground and doing operations from another console
> 
> 
> cqe unique: 4, opcode: GETATTR (3), nodeid: 1, insize: 16, pid: 732
>     unique: 4, result=104
> cqe unique: 6, opcode: STATFS (17), nodeid: 1, insize: 0, pid: 732
>     unique: 6, result=80
> 
> 
> In order to check it is not a recursion issue I also switched my VM to
> one core - still no issue. What is your setup?
> Also, I'm still on 6.10, I want to send out v5 with separated headers
> later this week and next week v6 (and maybe without RFC) for 6.12 next
> week.

I tried this again and could not repro anymore. I think your latest
libfuse that falls back to /dev/fuse fixed it. Sorry for the noise!

> 
> 
> Thanks,
> Bernd
David Wei Oct. 22, 2024, 5:12 p.m. UTC | #8
On 2024-10-22 03:24, Bernd Schubert wrote:
> 
> 
> On 10/21/24 22:57, David Wei wrote:
>> On 2024-10-21 04:47, Bernd Schubert wrote:
>>> Hi David,
>>>
>>> On 10/21/24 06:06, David Wei wrote:
>>>> [You don't often get email from dw@davidwei.uk. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> On 2024-10-15 17:05, Bernd Schubert wrote:
>>>> [...]
>>>>>
>>>
>>> ...
>>>
>>>> Hi Bernd, I applied this patchset to io_uring-6.12 branch with some
>>>> minor conflicts. I'm running the following command:
>>>>
>>>> $ sudo ./build/example/passthrough_hp -o allow_other --debug-fuse --nopassthrough \
>>>> --uring --uring-per-core-queue --uring-fg-depth=1 --uring-bg-depth=1 \
>>>> /home/vmuser/scratch/source /home/vmuser/scratch/dest
>>>> FUSE library version: 3.17.0
>>>> Creating ring per-core-queue=1 sync-depth=1 async-depth=1 arglen=1052672
>>>> dev unique: 2, opcode: INIT (26), nodeid: 0, insize: 104, pid: 0
>>>> INIT: 7.40
>>>> flags=0x73fffffb
>>>> max_readahead=0x00020000
>>>>      INIT: 7.40
>>>>      flags=0x4041f429
>>>>      max_readahead=0x00020000
>>>>      max_write=0x00100000
>>>>      max_background=0
>>>>      congestion_threshold=0
>>>>      time_gran=1
>>>>      unique: 2, success, outsize: 80
>>>>
>>>> I created the source and dest folders which are both empty.
>>>>
>>>> I see the following in dmesg:
>>>>
>>>> [ 2453.197510] uring is disabled
>>>> [ 2453.198525] uring is disabled
>>>> [ 2453.198749] uring is disabled
>>>> ...
>>>>
>>>> If I then try to list the directory /home/vmuser/scratch:
>>>>
>>>> $ ls -l /home/vmuser/scratch
>>>> ls: cannot access 'dest': Software caused connection abort
>>>>
>>>> And passthrough_hp terminates.
>>>>
>>>> My kconfig:
>>>>
>>>> CONFIG_FUSE_FS=m
>>>> CONFIG_FUSE_PASSTHROUGH=y
>>>> CONFIG_FUSE_IO_URING=y
>>>>
>>>> I'll look into it next week but, do you see anything obviously wrong?
>>>
>>>
>>> thanks for testing it! I just pushed a fix to my libfuse branches to
>>> avoid the abort for -EOPNOTSUPP. It will gracefully fall back to
>>> /dev/fuse IO now.
>>>
>>> Could you please use the rfcv4 branch, as the plain uring
>>> branch will soon get incompatible updates for rfc5?
>>>
>>> https://github.com/bsbernd/libfuse/tree/uring-for-rfcv4
>>>
>>>
>>> The short answer to let you enable fuse-io-uring:
>>>
>>> echo 1 >/sys/module/fuse/parameters/enable_uring
>>>
>>>
>>> (With that the "uring is disabled" should be fixed.)
>>
>> Thanks, using this branch fixed the issue and now I can see the dest
>> folder mirroring that of the source folder. There are two issues I
>> noticed:
>>
>> [63490.068211] ---[ end trace 0000000000000000 ]---
>> [64010.242963] BUG: sleeping function called from invalid context at include/linux/sched/mm.h:330
>> [64010.243531] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 11057, name: fuse-ring-1
>> [64010.244092] preempt_count: 1, expected: 0
>> [64010.244346] RCU nest depth: 0, expected: 0
>> [64010.244599] 2 locks held by fuse-ring-1/11057:
>> [64010.244886]  #0: ffff888105db20a8 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x900/0xd80
>> [64010.245476]  #1: ffff88810f941818 (&fc->lock){+.+.}-{2:2}, at: fuse_uring_cmd+0x83e/0x1890 [fuse]
>> [64010.246031] CPU: 1 UID: 0 PID: 11057 Comm: fuse-ring-1 Tainted: G        W          6.11.0-10089-g0d2090ccdbbe #2
>> [64010.246655] Tainted: [W]=WARN
>> [64010.246853] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
>> [64010.247542] Call Trace:
>> [64010.247705]  <TASK>
>> [64010.247860]  dump_stack_lvl+0xb0/0xd0
>> [64010.248090]  __might_resched+0x2f8/0x510
>> [64010.248338]  __kmalloc_cache_noprof+0x2aa/0x390
>> [64010.248614]  ? lockdep_init_map_type+0x2cb/0x7b0
>> [64010.248923]  ? fuse_uring_cmd+0xcc2/0x1890 [fuse]
>> [64010.249215]  fuse_uring_cmd+0xcc2/0x1890 [fuse]
>> [64010.249506]  io_uring_cmd+0x214/0x500
>> [64010.249745]  io_issue_sqe+0x588/0x1810
>> [64010.249999]  ? __pfx_io_issue_sqe+0x10/0x10
>> [64010.250254]  ? io_alloc_async_data+0x88/0x120
>> [64010.250516]  ? io_alloc_async_data+0x88/0x120
>> [64010.250811]  ? io_uring_cmd_prep+0x2eb/0x9f0
>> [64010.251103]  io_submit_sqes+0x796/0x1f80
>> [64010.251387]  __do_sys_io_uring_enter+0x90a/0xd80
>> [64010.251696]  ? do_user_addr_fault+0x26f/0xb60
>> [64010.251991]  ? __pfx___do_sys_io_uring_enter+0x10/0x10
>> [64010.252333]  ? __up_read+0x3ba/0x750
>> [64010.252565]  ? __pfx___up_read+0x10/0x10
>> [64010.252868]  do_syscall_64+0x68/0x140
>> [64010.253121]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> [64010.253444] RIP: 0033:0x7f03a03fb7af
>> [64010.253679] Code: 45 0f b6 90 d0 00 00 00 41 8b b8 cc 00 00 00 45 31 c0 41 b9 08 00 00 00 41 83 e2 01 41 c1 e2 04 41 09 c2 b8 aa 01 00 00 0f 05 <c3> a8 02 74 cc f0 48 83 0c 24 00 49 8b 40 20 8b 00 a8 01 74 bc b8
>> [64010.254801] RSP: 002b:00007f039f3ffd08 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
>> [64010.255261] RAX: ffffffffffffffda RBX: 0000561ab7c1ced0 RCX: 00007f03a03fb7af
>> [64010.255695] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000009
>> [64010.256127] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000008
>> [64010.256556] R10: 0000000000000000 R11: 0000000000000246 R12: 0000561ab7c1d7a8
>> [64010.256990] R13: 0000561ab7c1da00 R14: 0000561ab7c1d520 R15: 0000000000000001
>> [64010.257442]  </TASK>
> 
> Regarding issue one, does this patch solve it?
> 
> diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
> index e518d4379aa1..304919bc12fb 100644
> --- a/fs/fuse/dev_uring.c
> +++ b/fs/fuse/dev_uring.c
> @@ -168,6 +168,12 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
>         queue = kzalloc(sizeof(*queue), GFP_KERNEL_ACCOUNT);
>         if (!queue)
>                 return ERR_PTR(-ENOMEM);
> +       pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL);
> +       if (!pq) {
> +               kfree(queue);
> +               return ERR_PTR(-ENOMEM);
> +       }
> +
>         spin_lock(&fc->lock);
>         if (ring->queues[qid]) {
>                 spin_unlock(&fc->lock);
> @@ -186,11 +192,6 @@ static struct fuse_ring_queue *fuse_uring_create_queue(struct fuse_ring *ring,
>         INIT_LIST_HEAD(&queue->ent_in_userspace);
>         INIT_LIST_HEAD(&queue->fuse_req_queue);
> 
> -       pq = kcalloc(FUSE_PQ_HASH_SIZE, sizeof(struct list_head), GFP_KERNEL);
> -       if (!pq) {
> -               kfree(queue);
> -               return ERR_PTR(-ENOMEM);
> -       }
>         queue->fpq.processing = pq;
>         fuse_pqueue_init(&queue->fpq);
> 
> 
> I think we don't need GFP_ATOMIC, but can do allocations before taking
> the lock. This pq allocation is new in v4 and I forgot to put it into
> the right place and it slipped through my very basic testing (I'm
> concentrating on the design changes for now - testing will come back
> with v6).

Thanks, this patch fixed it for me.

> 
>>
>> If I am already in dest when I do the mount using passthrough_hp and
>> then e.g. ls, it hangs indefinitely even if I kill passthrough_hp.
> 
> I'm going to check in a bit. I hope it is not a recursion issue.
> 
> 
> Thanks,
> Bernd
David Wei Oct. 22, 2024, 10:10 p.m. UTC | #9
On 2024-10-15 17:05, Bernd Schubert wrote:
> RFCv1 and RFCv2 have been tested with multiple xfstest runs in a VM
> (32 cores) with a kernel that has several debug options
> enabled (like KASAN and MSAN). RFCv3 is not that well tested yet.
> O_DIRECT is currently not working well with /dev/fuse and
> also these patches, a patch has been submitted to fix that (although
> the approach is refused)
> https://www.spinics.net/lists/linux-fsdevel/msg280028.html

Hi Bernd, I applied this patch and the associated libfuse patch at:

https://github.com/bsbernd/libfuse/tree/aligned-writes

I have a simple Python FUSE client that is still returning EINVAL for
write():

with open(sys.argv[1], 'r+b') as f:
    mmapped_file = mmap.mmap(f.fileno(), 0)
    shm = shared_memory.SharedMemory(create=True, size=mmapped_file.size())
    shm.buf[:mmapped_file.size()] = mmapped_file[:]
    fd = os.open("/home/vmuser/scratch/dest/out", O_RDWR|O_CREAT|O_DIRECT)
    with open(fd, 'w+b') as f2:
        f2.write(bytes(shm.buf))
    mmapped_file.close()
    shm.unlink()
    shm.close()

I'll keep looking at this but letting you know in case it's something
obvious again.