Message ID | 20240529-fuse-uring-for-6-9-rfc2-out-v1-0-d149476b1d65@ddn.com (mailing list archive) |
---|---|
Headers | show |
Series | fuse: fuse-over-io-uring | expand |
On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote: > > From: Bernd Schubert <bschubert@ddn.com> > > This adds support for uring communication between kernel and > userspace daemon using opcode the IORING_OP_URING_CMD. The basic > appraoch was taken from ublk. The patches are in RFC state, > some major changes are still to be expected. > > Motivation for these patches is all to increase fuse performance. > In fuse-over-io-uring requests avoid core switching (application > on core X, processing of fuse server on random core Y) and use > shared memory between kernel and userspace to transfer data. > Similar approaches have been taken by ZUFS and FUSE2, though > not over io-uring, but through ioctl IOs > > https://lwn.net/Articles/756625/ > https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2 > > Avoiding cache line bouncing / numa systems was discussed > between Amir and Miklos before and Miklos had posted > part of the private discussion here > https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ > > This cache line bouncing should be addressed by these patches > as well. > > I had also noticed waitq wake-up latencies in fuse before > https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/ > > This spinning approach helped with performance (>40% improvement > for file creates), but due to random server side thread/core utilization > spinning cannot be well controlled in /dev/fuse mode. > With fuse-over-io-uring requests are handled on the same core > (sync requests) or on core+1 (large async requests) and performance > improvements are achieved without spinning. > > Splice/zero-copy is not supported yet, Ming Lei is working > on io-uring support for ublk_drv, but I think so far there > is no final agreement on the approach to be taken yet. > Fuse-over-io-uring runs significantly faster than reads/writes > over /dev/fuse, even with splice enabled, so missing zc > should not be a blocking issue. > > The patches have been tested with multiple xfstest runs in a VM > (32 cores) with a kernel that has several debug options > enabled (like KASAN and MSAN). > For some tests xfstests reports that O_DIRECT is not supported, > I need to investigate that. Interesting part is that exactly > these tests fail in plain /dev/fuse posix mode. I had to disabled > generic/650, which is enabling/disabling cpu cores - given ring > threads are bound to cores issues with that are no totally > unexpected, but then there (scheduler) kernel messages that > core binding for these threads is removed - this needs > to be further investigates. > Nice effect in io-uring mode is that tests run faster (like > generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still > slow as this is with ASAN/leak-detection/etc. > > The corresponding libfuse patches are on my uring branch, > but need cleanup for submission - will happen during the next > days. > https://github.com/bsbernd/libfuse/tree/uring > > If it should make review easier, patches posted here are on > this branch > https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2 > > TODO list for next RFC versions > - Let the ring configure ioctl return information, like mmap/queue-buf size > - Request kernel side address and len for a request - avoid calculation in userspace? > - multiple IO sizes per queue (avoiding a calculation in userspace is probably even > more important) > - FUSE_INTERRUPT handling? > - Logging (adds fields in the ioctl and also ring-request), > any mismatch between client and server is currently very hard to understand > through error codes > > Future work > - notifications, probably on their own ring > - zero copy > > I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023, > which, resulted in some tuning patches (at the end of the > patch series). > > Some benchmark results > ====================== > > System used for the benchmark is a 32 core (HyperThreading enabled) > Xeon E5-2650 system. I don't have local disks attached that could do > >5GB/s IOs, for paged and dio results a patched version of passthrough-hp > was used that bypasses final reads/writes. > > paged reads > ----------- > 128K IO size 1024K IO size > jobs /dev/fuse uring gain /dev/fuse uring gain > 1 1117 1921 1.72 1902 1942 1.02 > 2 2502 3527 1.41 3066 3260 1.06 > 4 5052 6125 1.21 5994 6097 1.02 > 8 6273 10855 1.73 7101 10491 1.48 > 16 6373 11320 1.78 7660 11419 1.49 > 24 6111 9015 1.48 7600 9029 1.19 > 32 5725 7968 1.39 6986 7961 1.14 > > dio reads (1024K) > ----------------- > > jobs /dev/fuse uring gain > 1 2023 3998 2.42 > 2 3375 7950 2.83 > 4 3823 15022 3.58 > 8 7796 22591 2.77 > 16 8520 27864 3.27 > 24 8361 20617 2.55 > 32 8717 12971 1.55 > > mmap reads (4K) > --------------- > (sequential, I probably should have made it random, sequential exposes > a rather interesting/weird 'optimized' memcpy issue - sequential becomes > reversed order 4K read) > https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/ > > jobs /dev/fuse uring gain > 1 130 323 2.49 > 2 219 538 2.46 > 4 503 1040 2.07 > 8 1472 2039 1.38 > 16 2191 3518 1.61 > 24 2453 4561 1.86 > 32 2178 5628 2.58 > > (Results on request, setting MAP_HUGETLB much improves performance > for both, io-uring mode then has a slight advantage only.) > > creates/s > ---------- > threads /dev/fuse uring gain > 1 3944 10121 2.57 > 2 8580 24524 2.86 > 4 16628 44426 2.67 > 8 46746 56716 1.21 > 16 79740 102966 1.29 > 20 80284 119502 1.49 > > (the gain drop with >=8 cores needs to be investigated) Hi Bernd, Those are impressive results! When approaching the FUSE uring feature from marketing POV, I think that putting the emphasis on metadata operations is the best approach. Not the dio reads are not important (I know that is part of your use case), but I imagine there are a lot more people out there waiting for improvement in metadata operations overhead. To me it helps to know what the current main pain points are for people using FUSE filesystems wrt performance. Although it may not be uptodate, the most comprehensive study about FUSE performance overhead is this FAST17 paper: https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf In this paper, table 3 summarizes the different overheads observed per workload. According to this table, the workloads that degrade performance worse on an optimized passthrough fs over SSD are: - many file creates - many file deletes - many small file reads In all these workloads, it was millions of files over many directories. The highest performance regression reported was -83% on many small file creations. The moral of this long story is that it would be nice to know what performance improvement FUSE uring can aspire to. This is especially relevant for people that would be interested in combining the benefits of FUSE passthrough (for data) and FUSE uring (for metadata). What did passthrough_hp do in your patched version with creates? Did it actually create the files? In how many directories? Maybe the directory inode lock impeded performance improvement with >=8 threads? > > Remaining TODO list for RFCv3: > -------------------------------- > 1) Let the ring configure ioctl return information, > like mmap/queue-buf size > > Right now libfuse and kernel have lots of duplicated setup code > and any kind of pointer/offset mismatch results in a non-working > ring that is hard to debug - probably better when the kernel does > the calculations and returns that to server side > > 2) In combination with 1, ring requests should retrieve their > userspace address and length from kernel side instead of > calculating it through the mmaped queue buffer on their own. > (Introduction of FUSE_URING_BUF_ADDR_FETCH) > > 3) Add log buffer into the ioctl and ring-request > > This is to provide better error messages (instead of just > errno) > > 3) Multiple IO sizes per queue > > Small IOs and metadata requests do not need large buffer sizes, > we need multiple IO sizes per queue. > > 4) FUSE_INTERRUPT handling > > These are not handled yet, kernel side is probably not difficult > anymore as ring entries take fuse requests through lists. > > Long term TODO: > -------------- > Notifications through io-uring, maybe with a separated ring, > but I'm not sure yet. Is that going to improve performance in any real life workload? Thanks, Amir. > > Changes since RFCv1 > ------------------- > - No need to hold the task of the server side anymore. Also no > ioctls/threads waiting for shutdown anymore. Shutdown now more > works like the traditional fuse way. > - Each queue clones the fuse and device release makes an exception > for io-uring. Reason is that queued IORING_OP_URING_CMD > (through .uring_cmd) prevent a device release. I.e. a killed > server side typically triggers fuse_abort_conn(). This was the > reason for the async stop-monitor in v1 and reference on the daemon > task. However it was very racy and annotated immediately by Miklos. > - In v1 the offset parameter to mmap was identifying the QID, in v2 > server side is expected to send mmap from a core bound ring thread > in numa mode and numa node is taken through the core of that thread. > Kernel side of the mmap buffer is stored in an rbtree and assigned > to the right qid through an additional queue ioctl. > - Release of IORING_OP_URING_CMD is done through lists now, instead > of iterating over the entire array of queues/entries and does not > depend on the entry state anymore (a bit of the state is still left > for sanity check). > - Finding free ring queue entries is done through lists and not through > a bitmap anymore > - Many other code changes and bug fixes > - Performance tunings > > --- > Bernd Schubert (19): > fuse: rename to fuse_dev_end_requests and make non-static > fuse: Move fuse_get_dev to header file > fuse: Move request bits > fuse: Add fuse-io-uring design documentation > fuse: Add a uring config ioctl > Add a vmalloc_node_user function > fuse uring: Add an mmap method > fuse: Add the queue configuration ioctl > fuse: {uring} Add a dev_release exception for fuse-over-io-uring > fuse: {uring} Handle SQEs - register commands > fuse: Add support to copy from/to the ring buffer > fuse: {uring} Add uring sqe commit and fetch support > fuse: {uring} Handle uring shutdown > fuse: {uring} Allow to queue to the ring > export __wake_on_current_cpu > fuse: {uring} Wake requests on the the current cpu > fuse: {uring} Send async requests to qid of core + 1 > fuse: {uring} Set a min cpu offset io-size for reads/writes > fuse: {uring} Optimize async sends > > Documentation/filesystems/fuse-io-uring.rst | 167 ++++ > fs/fuse/Kconfig | 12 + > fs/fuse/Makefile | 1 + > fs/fuse/dev.c | 310 +++++-- > fs/fuse/dev_uring.c | 1232 +++++++++++++++++++++++++++ > fs/fuse/dev_uring_i.h | 395 +++++++++ > fs/fuse/file.c | 15 +- > fs/fuse/fuse_dev_i.h | 67 ++ > fs/fuse/fuse_i.h | 9 + > fs/fuse/inode.c | 3 + > include/linux/vmalloc.h | 1 + > include/uapi/linux/fuse.h | 135 +++ > kernel/sched/wait.c | 1 + > mm/nommu.c | 6 + > mm/vmalloc.c | 41 +- > 15 files changed, 2330 insertions(+), 65 deletions(-) > --- > base-commit: dd5a440a31fae6e459c0d6271dddd62825505361 > change-id: 20240529-fuse-uring-for-6-9-rfc2-out-f0a009005fdf > > Best regards, > -- > Bernd Schubert <bschubert@ddn.com> >
On 5/30/24 09:07, Amir Goldstein wrote: > On Wed, May 29, 2024 at 9:01 PM Bernd Schubert <bschubert@ddn.com> wrote: >> >> From: Bernd Schubert <bschubert@ddn.com> >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. >> >> Motivation for these patches is all to increase fuse performance. >> In fuse-over-io-uring requests avoid core switching (application >> on core X, processing of fuse server on random core Y) and use >> shared memory between kernel and userspace to transfer data. >> Similar approaches have been taken by ZUFS and FUSE2, though >> not over io-uring, but through ioctl IOs >> >> https://lwn.net/Articles/756625/ >> https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git/log/?h=fuse2 >> >> Avoiding cache line bouncing / numa systems was discussed >> between Amir and Miklos before and Miklos had posted >> part of the private discussion here >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ >> >> This cache line bouncing should be addressed by these patches >> as well. >> >> I had also noticed waitq wake-up latencies in fuse before >> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/ >> >> This spinning approach helped with performance (>40% improvement >> for file creates), but due to random server side thread/core utilization >> spinning cannot be well controlled in /dev/fuse mode. >> With fuse-over-io-uring requests are handled on the same core >> (sync requests) or on core+1 (large async requests) and performance >> improvements are achieved without spinning. >> >> Splice/zero-copy is not supported yet, Ming Lei is working >> on io-uring support for ublk_drv, but I think so far there >> is no final agreement on the approach to be taken yet. >> Fuse-over-io-uring runs significantly faster than reads/writes >> over /dev/fuse, even with splice enabled, so missing zc >> should not be a blocking issue. >> >> The patches have been tested with multiple xfstest runs in a VM >> (32 cores) with a kernel that has several debug options >> enabled (like KASAN and MSAN). >> For some tests xfstests reports that O_DIRECT is not supported, >> I need to investigate that. Interesting part is that exactly >> these tests fail in plain /dev/fuse posix mode. I had to disabled >> generic/650, which is enabling/disabling cpu cores - given ring >> threads are bound to cores issues with that are no totally >> unexpected, but then there (scheduler) kernel messages that >> core binding for these threads is removed - this needs >> to be further investigates. >> Nice effect in io-uring mode is that tests run faster (like >> generic/522 ~2400s /dev/fuse vs. ~1600s patched), though still >> slow as this is with ASAN/leak-detection/etc. >> >> The corresponding libfuse patches are on my uring branch, >> but need cleanup for submission - will happen during the next >> days. >> https://github.com/bsbernd/libfuse/tree/uring >> >> If it should make review easier, patches posted here are on >> this branch >> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.9-rfc2 >> >> TODO list for next RFC versions >> - Let the ring configure ioctl return information, like mmap/queue-buf size >> - Request kernel side address and len for a request - avoid calculation in userspace? >> - multiple IO sizes per queue (avoiding a calculation in userspace is probably even >> more important) >> - FUSE_INTERRUPT handling? >> - Logging (adds fields in the ioctl and also ring-request), >> any mismatch between client and server is currently very hard to understand >> through error codes >> >> Future work >> - notifications, probably on their own ring >> - zero copy >> >> I had run quite some benchmarks with linux-6.2 before LSFMMBPF2023, >> which, resulted in some tuning patches (at the end of the >> patch series). >> >> Some benchmark results >> ====================== >> >> System used for the benchmark is a 32 core (HyperThreading enabled) >> Xeon E5-2650 system. I don't have local disks attached that could do >>> 5GB/s IOs, for paged and dio results a patched version of passthrough-hp >> was used that bypasses final reads/writes. >> >> paged reads >> ----------- >> 128K IO size 1024K IO size >> jobs /dev/fuse uring gain /dev/fuse uring gain >> 1 1117 1921 1.72 1902 1942 1.02 >> 2 2502 3527 1.41 3066 3260 1.06 >> 4 5052 6125 1.21 5994 6097 1.02 >> 8 6273 10855 1.73 7101 10491 1.48 >> 16 6373 11320 1.78 7660 11419 1.49 >> 24 6111 9015 1.48 7600 9029 1.19 >> 32 5725 7968 1.39 6986 7961 1.14 >> >> dio reads (1024K) >> ----------------- >> >> jobs /dev/fuse uring gain >> 1 2023 3998 2.42 >> 2 3375 7950 2.83 >> 4 3823 15022 3.58 >> 8 7796 22591 2.77 >> 16 8520 27864 3.27 >> 24 8361 20617 2.55 >> 32 8717 12971 1.55 >> >> mmap reads (4K) >> --------------- >> (sequential, I probably should have made it random, sequential exposes >> a rather interesting/weird 'optimized' memcpy issue - sequential becomes >> reversed order 4K read) >> https://lore.kernel.org/linux-fsdevel/aae918da-833f-7ec5-ac8a-115d66d80d0e@fastmail.fm/ >> >> jobs /dev/fuse uring gain >> 1 130 323 2.49 >> 2 219 538 2.46 >> 4 503 1040 2.07 >> 8 1472 2039 1.38 >> 16 2191 3518 1.61 >> 24 2453 4561 1.86 >> 32 2178 5628 2.58 >> >> (Results on request, setting MAP_HUGETLB much improves performance >> for both, io-uring mode then has a slight advantage only.) >> >> creates/s >> ---------- >> threads /dev/fuse uring gain >> 1 3944 10121 2.57 >> 2 8580 24524 2.86 >> 4 16628 44426 2.67 >> 8 46746 56716 1.21 >> 16 79740 102966 1.29 >> 20 80284 119502 1.49 >> >> (the gain drop with >=8 cores needs to be investigated) > Hi Amir, > Hi Bernd, > > Those are impressive results! thank you! > > When approaching the FUSE uring feature from marketing POV, > I think that putting the emphasis on metadata operations is the > best approach. I can add in some more results and probably need to redo at least the metadata tests. I have all the results in google docs and in plain text files, just a bit cumbersome maybe also spam to post all of it here. > > Not the dio reads are not important (I know that is part of your use case), > but I imagine there are a lot more people out there waiting for > improvement in metadata operations overhead. I think the DIO use case is declining. My fuse work is now related to the DDN Infina project, which has a DLM - this will all go via cache and notifications (into from/to client/server) I need to start to work on that asap... I'm also not too happy yet about cached writes/reads - need to find time to investigate where the limit is. > > To me it helps to know what the current main pain points are > for people using FUSE filesystems wrt performance. > > Although it may not be uptodate, the most comprehensive > study about FUSE performance overhead is this FAST17 paper: > > https://www.usenix.org/system/files/conference/fast17/fast17-vangoor.pdf Yeah, I had seen it. Just checking again, interesting is actually their instrumentation branch https://github.com/sbu-fsl/fuse-kernel-instrumentation This should be very useful upstream, in combination with Josefs fuse tracepoints (btw, thanks for the tracepoint patch Josef! I'm going to look at it and test it tomorrow). > > In this paper, table 3 summarizes the different overheads observed > per workload. According to this table, the workloads that degrade > performance worse on an optimized passthrough fs over SSD are: > - many file creates > - many file deletes > - many small file reads > In all these workloads, it was millions of files over many directories. > The highest performance regression reported was -83% on many > small file creations. > > The moral of this long story is that it would be nice to know > what performance improvement FUSE uring can aspire to. > This is especially relevant for people that would be interested > in combining the benefits of FUSE passthrough (for data) and > FUSE uring (for metadata). As written above, I can add a few more data. But if possible I wouldn't like to concentrate on benchmarking - this can be super time consuming and doesn't help unless one investigates what is actually limiting performance. Right now we see that io-uring helps, fixing the other limits is then the next step, imho. > > What did passthrough_hp do in your patched version with creates? > Did it actually create the files? Yeah, it creates files, I think on xfs (or ext4). I had tried tmpfs first, but it had issues with seekdir/telldir until recently - will switch back to tmpfs for next tests. > In how many directories? > Maybe the directory inode lock impeded performance improvement > with >=8 threads? I don't think the directory inode lock is an issue - this should be one (or more directories) per thread Basically /usr/lib64/openmpi/bin/mpirun \ --mca btl self -n $i --oversubscribe \ ./mdtest -F -n40000 -i1 \ -d /scratch/dest -u -b2 | tee ${fname}-$i.out (mdtest is really convenient for meta operations, although requires mpi, recent versions are here (the initial LLNL project merged with ior). https://github.com/hpc/ior "-F" Perform test on files only (no directories). "-n" number_of_items Every process will creat/stat/remove # directories and files "-i" iterations The number of iterations the test will run "-u" Create a unique working directory for each task "-b" branching_factor The branching factor of the hierarchical directory structure [default: 1]. (The older LLNL repo has a better mdtest README https://github.com/LLNL/mdtest) Also, regarding metadata, I definitely need to find time resume work on atomic-open. Besides performance, there is another use case https://github.com/libfuse/libfuse/issues/945. Sweet Tea Dorminy / Josef also seem to need that. > >> >> Remaining TODO list for RFCv3: >> -------------------------------- >> 1) Let the ring configure ioctl return information, >> like mmap/queue-buf size >> >> Right now libfuse and kernel have lots of duplicated setup code >> and any kind of pointer/offset mismatch results in a non-working >> ring that is hard to debug - probably better when the kernel does >> the calculations and returns that to server side >> >> 2) In combination with 1, ring requests should retrieve their >> userspace address and length from kernel side instead of >> calculating it through the mmaped queue buffer on their own. >> (Introduction of FUSE_URING_BUF_ADDR_FETCH) >> >> 3) Add log buffer into the ioctl and ring-request >> >> This is to provide better error messages (instead of just >> errno) >> >> 3) Multiple IO sizes per queue >> >> Small IOs and metadata requests do not need large buffer sizes, >> we need multiple IO sizes per queue. >> >> 4) FUSE_INTERRUPT handling >> >> These are not handled yet, kernel side is probably not difficult >> anymore as ring entries take fuse requests through lists. >> >> Long term TODO: >> -------------- >> Notifications through io-uring, maybe with a separated ring, >> but I'm not sure yet. > > Is that going to improve performance in any real life workload? > I'm rather sure that we at DDN will need it for our project with the DLM. I have other priorities for now - once it comes up, adding notifications over uring shouldn't be difficult. Thanks, Bernd
On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: > From: Bernd Schubert <bschubert@ddn.com> > > This adds support for uring communication between kernel and > userspace daemon using opcode the IORING_OP_URING_CMD. The basic > appraoch was taken from ublk. The patches are in RFC state, > some major changes are still to be expected. > > Motivation for these patches is all to increase fuse performance. > In fuse-over-io-uring requests avoid core switching (application > on core X, processing of fuse server on random core Y) and use > shared memory between kernel and userspace to transfer data. > Similar approaches have been taken by ZUFS and FUSE2, though > not over io-uring, but through ioctl IOs What specifically is it about io-uring that's helpful here? Besides the ringbuffer? So the original mess was that because we didn't have a generic ringbuffer, we had aio, tracing, and god knows what else all implementing their own special purpose ringbuffers (all with weird quirks of debatable or no usefulness). It seems to me that what fuse (and a lot of other things want) is just a clean simple easy to use generic ringbuffer for sending what-have-you back and forth between the kernel and userspace - in this case RPCs from the kernel to userspace. But instead, the solution seems to be just toss everything into a new giant subsystem?
On 5/30/24 17:36, Kent Overstreet wrote: > On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: >> From: Bernd Schubert <bschubert@ddn.com> >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. >> >> Motivation for these patches is all to increase fuse performance. >> In fuse-over-io-uring requests avoid core switching (application >> on core X, processing of fuse server on random core Y) and use >> shared memory between kernel and userspace to transfer data. >> Similar approaches have been taken by ZUFS and FUSE2, though >> not over io-uring, but through ioctl IOs > > What specifically is it about io-uring that's helpful here? Besides the > ringbuffer? > > So the original mess was that because we didn't have a generic > ringbuffer, we had aio, tracing, and god knows what else all > implementing their own special purpose ringbuffers (all with weird > quirks of debatable or no usefulness). > > It seems to me that what fuse (and a lot of other things want) is just a > clean simple easy to use generic ringbuffer for sending what-have-you > back and forth between the kernel and userspace - in this case RPCs from > the kernel to userspace. > > But instead, the solution seems to be just toss everything into a new > giant subsystem? Hmm, initially I had thought about writing my own ring buffer, but then io-uring got IORING_OP_URING_CMD, which seems to have exactly what we need? From interface point of view, io-uring seems easy to use here, has everything we need and kind of the same thing is used for ublk - what speaks against io-uring? And what other suggestion do you have? I guess the same concern would also apply to ublk_drv. Well, decoupling from io-uring might help to get for zero-copy, as there doesn't seem to be an agreement with Mings approaches (sorry I'm only silently following for now). From our side, a customer has pointed out security concerns for io-uring. My thinking so far was to implemented the required io-uring pieces into an module and access it with ioctls... Which would also allow to backport it to RHEL8/RHEL9. Thanks, Bernd
On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote: > Hmm, initially I had thought about writing my own ring buffer, but then > io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > need? From interface point of view, io-uring seems easy to use here, > has everything we need and kind of the same thing is used for ublk - > what speaks against io-uring? And what other suggestion do you have? > > I guess the same concern would also apply to ublk_drv. > > Well, decoupling from io-uring might help to get for zero-copy, as there > doesn't seem to be an agreement with Mings approaches (sorry I'm only > silently following for now). > > From our side, a customer has pointed out security concerns for io-uring. > My thinking so far was to implemented the required io-uring pieces into > an module and access it with ioctls... Which would also allow to > backport it to RHEL8/RHEL9. Well, I've been starting to sketch out a ringbuffer() syscall, which would work on any (supported) file descriptor and give you a ringbuffer for reading or writing (or call it twice for both). That seems to be what fuse really wants, no? You're already using a file descriptor and your own RPC format, you just want a faster communications channel.
On 5/30/24 18:10, Kent Overstreet wrote: > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote: >> Hmm, initially I had thought about writing my own ring buffer, but then >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we >> need? From interface point of view, io-uring seems easy to use here, >> has everything we need and kind of the same thing is used for ublk - >> what speaks against io-uring? And what other suggestion do you have? >> >> I guess the same concern would also apply to ublk_drv. >> >> Well, decoupling from io-uring might help to get for zero-copy, as there >> doesn't seem to be an agreement with Mings approaches (sorry I'm only >> silently following for now). >> >> From our side, a customer has pointed out security concerns for io-uring. >> My thinking so far was to implemented the required io-uring pieces into >> an module and access it with ioctls... Which would also allow to >> backport it to RHEL8/RHEL9. > > Well, I've been starting to sketch out a ringbuffer() syscall, which > would work on any (supported) file descriptor and give you a ringbuffer > for reading or writing (or call it twice for both). > > That seems to be what fuse really wants, no? You're already using a file > descriptor and your own RPC format, you just want a faster > communications channel. Fine with me, if you have something better/simpler with less security concerns - why not. We just need a community agreement on that. Do you have something I could look at? Thanks, Bernd
On 5/30/24 10:02 AM, Bernd Schubert wrote: > > > On 5/30/24 17:36, Kent Overstreet wrote: >> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: >>> From: Bernd Schubert <bschubert@ddn.com> >>> >>> This adds support for uring communication between kernel and >>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >>> appraoch was taken from ublk. The patches are in RFC state, >>> some major changes are still to be expected. >>> >>> Motivation for these patches is all to increase fuse performance. >>> In fuse-over-io-uring requests avoid core switching (application >>> on core X, processing of fuse server on random core Y) and use >>> shared memory between kernel and userspace to transfer data. >>> Similar approaches have been taken by ZUFS and FUSE2, though >>> not over io-uring, but through ioctl IOs >> >> What specifically is it about io-uring that's helpful here? Besides the >> ringbuffer? >> >> So the original mess was that because we didn't have a generic >> ringbuffer, we had aio, tracing, and god knows what else all >> implementing their own special purpose ringbuffers (all with weird >> quirks of debatable or no usefulness). >> >> It seems to me that what fuse (and a lot of other things want) is just a >> clean simple easy to use generic ringbuffer for sending what-have-you >> back and forth between the kernel and userspace - in this case RPCs from >> the kernel to userspace. >> >> But instead, the solution seems to be just toss everything into a new >> giant subsystem? > > > Hmm, initially I had thought about writing my own ring buffer, but then > io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > need? From interface point of view, io-uring seems easy to use here, > has everything we need and kind of the same thing is used for ublk - > what speaks against io-uring? And what other suggestion do you have? > > I guess the same concern would also apply to ublk_drv. > > Well, decoupling from io-uring might help to get for zero-copy, as there > doesn't seem to be an agreement with Mings approaches (sorry I'm only > silently following for now). If you have an interest in the zero copy, do chime in, it would certainly help get some closure on that feature. I don't think anyone disagrees it's a useful and needed feature, but there are different view points on how it's best solved. > From our side, a customer has pointed out security concerns for io-uring. That's just bs and fud these days.
On 5/30/24 18:21, Jens Axboe wrote: > On 5/30/24 10:02 AM, Bernd Schubert wrote: >> >> >> On 5/30/24 17:36, Kent Overstreet wrote: >>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: >>>> From: Bernd Schubert <bschubert@ddn.com> >>>> >>>> This adds support for uring communication between kernel and >>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >>>> appraoch was taken from ublk. The patches are in RFC state, >>>> some major changes are still to be expected. >>>> >>>> Motivation for these patches is all to increase fuse performance. >>>> In fuse-over-io-uring requests avoid core switching (application >>>> on core X, processing of fuse server on random core Y) and use >>>> shared memory between kernel and userspace to transfer data. >>>> Similar approaches have been taken by ZUFS and FUSE2, though >>>> not over io-uring, but through ioctl IOs >>> >>> What specifically is it about io-uring that's helpful here? Besides the >>> ringbuffer? >>> >>> So the original mess was that because we didn't have a generic >>> ringbuffer, we had aio, tracing, and god knows what else all >>> implementing their own special purpose ringbuffers (all with weird >>> quirks of debatable or no usefulness). >>> >>> It seems to me that what fuse (and a lot of other things want) is just a >>> clean simple easy to use generic ringbuffer for sending what-have-you >>> back and forth between the kernel and userspace - in this case RPCs from >>> the kernel to userspace. >>> >>> But instead, the solution seems to be just toss everything into a new >>> giant subsystem? >> >> >> Hmm, initially I had thought about writing my own ring buffer, but then >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we >> need? From interface point of view, io-uring seems easy to use here, >> has everything we need and kind of the same thing is used for ublk - >> what speaks against io-uring? And what other suggestion do you have? >> >> I guess the same concern would also apply to ublk_drv. >> >> Well, decoupling from io-uring might help to get for zero-copy, as there >> doesn't seem to be an agreement with Mings approaches (sorry I'm only >> silently following for now). > > If you have an interest in the zero copy, do chime in, it would > certainly help get some closure on that feature. I don't think anyone > disagrees it's a useful and needed feature, but there are different view > points on how it's best solved. We had a bit of discussion with Ming about that last year, besides that I got busy with other parts, it got a bit less of personal interest for me as our project really needs to access the buffer (additional checksums, sending it out over network library (libfabric), possibly even preprocessing of some data) - I think it makes sense if I work on the other fuse parts first and only come back zero copy a bit later. > >> From our side, a customer has pointed out security concerns for io-uring. > > That's just bs and fud these days. I wasn't in contact with that customer personally, I had just seen their email.It would probably help if RHEL would eventually gain io-uring support - almost all of HPC systems are using it or a clone. I was always hoping that RHEL would get it before I'm done with fuse-over-io-uring, now I'm not so sure anymore. Thanks, Bernd
On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote: > On 5/30/24 10:02 AM, Bernd Schubert wrote: > > From our side, a customer has pointed out security concerns for io-uring. > > That's just bs and fud these days. You have a history of being less than responsive with bug reports, and this sort of attitude is not the attitude of a responsible maintainer. From what I've seen those concerns were well founded, so if you want to be taking seriously I'd be talking about what was done to address them instead of namecalling.
On 5/30/24 10:32 AM, Bernd Schubert wrote: > > > On 5/30/24 18:21, Jens Axboe wrote: >> On 5/30/24 10:02 AM, Bernd Schubert wrote: >>> >>> >>> On 5/30/24 17:36, Kent Overstreet wrote: >>>> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: >>>>> From: Bernd Schubert <bschubert@ddn.com> >>>>> >>>>> This adds support for uring communication between kernel and >>>>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >>>>> appraoch was taken from ublk. The patches are in RFC state, >>>>> some major changes are still to be expected. >>>>> >>>>> Motivation for these patches is all to increase fuse performance. >>>>> In fuse-over-io-uring requests avoid core switching (application >>>>> on core X, processing of fuse server on random core Y) and use >>>>> shared memory between kernel and userspace to transfer data. >>>>> Similar approaches have been taken by ZUFS and FUSE2, though >>>>> not over io-uring, but through ioctl IOs >>>> >>>> What specifically is it about io-uring that's helpful here? Besides the >>>> ringbuffer? >>>> >>>> So the original mess was that because we didn't have a generic >>>> ringbuffer, we had aio, tracing, and god knows what else all >>>> implementing their own special purpose ringbuffers (all with weird >>>> quirks of debatable or no usefulness). >>>> >>>> It seems to me that what fuse (and a lot of other things want) is just a >>>> clean simple easy to use generic ringbuffer for sending what-have-you >>>> back and forth between the kernel and userspace - in this case RPCs from >>>> the kernel to userspace. >>>> >>>> But instead, the solution seems to be just toss everything into a new >>>> giant subsystem? >>> >>> >>> Hmm, initially I had thought about writing my own ring buffer, but then >>> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we >>> need? From interface point of view, io-uring seems easy to use here, >>> has everything we need and kind of the same thing is used for ublk - >>> what speaks against io-uring? And what other suggestion do you have? >>> >>> I guess the same concern would also apply to ublk_drv. >>> >>> Well, decoupling from io-uring might help to get for zero-copy, as there >>> doesn't seem to be an agreement with Mings approaches (sorry I'm only >>> silently following for now). >> >> If you have an interest in the zero copy, do chime in, it would >> certainly help get some closure on that feature. I don't think anyone >> disagrees it's a useful and needed feature, but there are different view >> points on how it's best solved. > > We had a bit of discussion with Ming about that last year, besides that > I got busy with other parts, it got a bit less of personal interest for > me as our project really needs to access the buffer (additional > checksums, sending it out over network library (libfabric), possibly > even preprocessing of some data) - I think it makes sense if I work on > the other fuse parts first and only come back zero copy a bit later. Ah I see - yes if you're going to be touching the data anyway, zero copy is less of a concern. Some memory bandwidth can still be saved if you're not touching all of it, of course. But if you are, you're probably better off copying it in the first place. >>> From our side, a customer has pointed out security concerns for io-uring. >> >> That's just bs and fud these days. > > I wasn't in contact with that customer personally, I had just seen their > email.It would probably help if RHEL would eventually gain io-uring > support - almost all of HPC systems are using it or a clone. I was > always hoping that RHEL would get it before I'm done with > fuse-over-io-uring, now I'm not so sure anymore. Not sure what the RHEL status is. I know backports are done on the io_uring side, but not sure what base they are currently on. I strongly suspect that would be a gating factor for getting it enabled. If it's too out of date, then performance isn't going to be as good as current mainline anyway.
On 5/30/24 11:16 AM, Kent Overstreet wrote: > On Thu, May 30, 2024 at 10:21:19AM -0600, Jens Axboe wrote: >> On 5/30/24 10:02 AM, Bernd Schubert wrote: >>> From our side, a customer has pointed out security concerns for io-uring. >> >> That's just bs and fud these days. > > You have a history of being less than responsive with bug reports, and > this sort of attitude is not the attitude of a responsible maintainer. Ok... That's a bold claim. We actually tend to bug reports quickly and get them resolved in a timely manner. Maybe I've been less responsive on a bug report from you, but that's usually because the emails turn out like this one, with odd and unwarranted claims. Not taking the bait. If you're referring to the file reference and umount issue, yes I do very much want to get that one resolved. I do have patches for that, but was never quite happy with them. As it isn't a stability or safety concern, and not a practical concern outside of the test case in question, it hasn't been super high on the radar unfortunately. > From what I've seen those concerns were well founded, so if you want to > be taking seriously I'd be talking about what was done to address them > instead of namecalling. I have addressed it several times in the past. tldr is that yeah the initial history of io_uring wasn't great, due to some unfortunate initial design choices (mostly around async worker setup and identities). Those have since been rectified, and the code base is stable and solid these days.
On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote: > > > On 5/30/24 18:10, Kent Overstreet wrote: > > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote: > >> Hmm, initially I had thought about writing my own ring buffer, but then > >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > >> need? From interface point of view, io-uring seems easy to use here, > >> has everything we need and kind of the same thing is used for ublk - > >> what speaks against io-uring? And what other suggestion do you have? > >> > >> I guess the same concern would also apply to ublk_drv. > >> > >> Well, decoupling from io-uring might help to get for zero-copy, as there > >> doesn't seem to be an agreement with Mings approaches (sorry I'm only > >> silently following for now). > >> > >> From our side, a customer has pointed out security concerns for io-uring. > >> My thinking so far was to implemented the required io-uring pieces into > >> an module and access it with ioctls... Which would also allow to > >> backport it to RHEL8/RHEL9. > > > > Well, I've been starting to sketch out a ringbuffer() syscall, which > > would work on any (supported) file descriptor and give you a ringbuffer > > for reading or writing (or call it twice for both). > > > > That seems to be what fuse really wants, no? You're already using a file > > descriptor and your own RPC format, you just want a faster > > communications channel. > > Fine with me, if you have something better/simpler with less security > concerns - why not. We just need a community agreement on that. > > Do you have something I could look at? Like I said it's at the early sketch stage, I haven't written any code yet. But I'm envisioning something very simple - just a syscall that gives you a mapped buffer of a specified size with head and tail pointers. But this has been kicking around for awhile, so if you're interested I could probably have something for you to try out in the next few days.
On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote: > I have addressed it several times in the past. tldr is that yeah the > initial history of io_uring wasn't great, due to some unfortunate > initial design choices (mostly around async worker setup and > identities). Not to pick on you too much but the initial history looked pretty messy to me - a lot of layering violations - it made aio.c look clean. I know you were in "get shit done" mode, but at some point we have to take a step back and ask "what are the different core concepts being expressed here, and can we start picking them apart?". A generic ringbuffer would be a good place to start. I'd also really like to see some more standardized mechanisms for "I'm a kernel thread doing work on behalf of some other user thread" - this comes up elsewhere, I'm talking with David Howells right now about fsconfig which is another place it is or will be coming up. > Those have since been rectified, and the code base is > stable and solid these days. good tests, code coverage analysis to verify, good syzbot coverage?
On 5/30/24 11:58 AM, Kent Overstreet wrote: > On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote: >> I have addressed it several times in the past. tldr is that yeah the >> initial history of io_uring wasn't great, due to some unfortunate >> initial design choices (mostly around async worker setup and >> identities). > > Not to pick on you too much but the initial history looked pretty messy > to me - a lot of layering violations - it made aio.c look clean. Oh I certainly agree, the initial code was in a much worse state than it is in now. Lots of things have happened there, like splitting things up and adding appropriate layering. That was more of a code hygiene kind of thing, to make it easier to understand, maintain, and develop. Any new subsystem is going to see lots of initial churn, regardless of how long it's been developed before going into upstream. We certainly had lots of churn, where these days it's stabilized. I don't think that's unusual, particularly for something that attempts to do certain things very differently. I would've loved to start with our current state, but I don't think years of being out of tree would've completely solved that. Some things you just don't find until it's in tree, unfortunately. > I know you were in "get shit done" mode, but at some point we have to > take a step back and ask "what are the different core concepts being > expressed here, and can we start picking them apart?". A generic > ringbuffer would be a good place to start. > > I'd also really like to see some more standardized mechanisms for "I'm a > kernel thread doing work on behalf of some other user thread" - this > comes up elsewhere, I'm talking with David Howells right now about > fsconfig which is another place it is or will be coming up. That does exist, and it came from the io_uring side of needing exactly that. This is why we have create_io_thread(). IMHO it's the only sane way to do it, trying to guesstimate what happens deep down in a random callstack, and setting things up appropriately, is impossible. This is where most of the earlier day io_uring issues came from, and what I referred to as a "poor initial design choice". >> Those have since been rectified, and the code base is >> stable and solid these days. > > good tests, code coverage analysis to verify, good syzbot coverage? 3x yes. Obviously I'm always going to say that tests could be better, have better coverage, cover more things, because nothing is perfect (and if you think it is, you're fooling yourself) and as a maintainer I want perfect coverage. But we're pretty diligent these days about adding tests for everything. And any regression or bug report always gets test cases written.
On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote: > > > On 5/30/24 18:10, Kent Overstreet wrote: > > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote: > >> Hmm, initially I had thought about writing my own ring buffer, but then > >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > >> need? From interface point of view, io-uring seems easy to use here, > >> has everything we need and kind of the same thing is used for ublk - > >> what speaks against io-uring? And what other suggestion do you have? > >> > >> I guess the same concern would also apply to ublk_drv. > >> > >> Well, decoupling from io-uring might help to get for zero-copy, as there > >> doesn't seem to be an agreement with Mings approaches (sorry I'm only > >> silently following for now). > >> > >> From our side, a customer has pointed out security concerns for io-uring. > >> My thinking so far was to implemented the required io-uring pieces into > >> an module and access it with ioctls... Which would also allow to > >> backport it to RHEL8/RHEL9. > > > > Well, I've been starting to sketch out a ringbuffer() syscall, which > > would work on any (supported) file descriptor and give you a ringbuffer > > for reading or writing (or call it twice for both). > > > > That seems to be what fuse really wants, no? You're already using a file > > descriptor and your own RPC format, you just want a faster > > communications channel. > > Fine with me, if you have something better/simpler with less security > concerns - why not. We just need a community agreement on that. > > Do you have something I could look at? FWIW I have no strong feelings between using iouring vs any other ringbuffer mechanism we come up with in the future. That being said iouring is here now, is proven to work, and these are good performance improvements. If in the future something else comes along that gives us better performance then absolutely we should explore adding that functionality. But this solves the problem today, and I need the problem solved yesterday, so continuing with this patchset is very much a worthwhile investment, one that I'm very happy you're tackling Bernd instead of me ;). Thanks, Josef
On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote: > On 5/30/24 11:58 AM, Kent Overstreet wrote: > > On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote: > >> I have addressed it several times in the past. tldr is that yeah the > >> initial history of io_uring wasn't great, due to some unfortunate > >> initial design choices (mostly around async worker setup and > >> identities). > > > > Not to pick on you too much but the initial history looked pretty messy > > to me - a lot of layering violations - it made aio.c look clean. > > Oh I certainly agree, the initial code was in a much worse state than it > is in now. Lots of things have happened there, like splitting things up > and adding appropriate layering. That was more of a code hygiene kind of > thing, to make it easier to understand, maintain, and develop. > > Any new subsystem is going to see lots of initial churn, regardless of > how long it's been developed before going into upstream. We certainly > had lots of churn, where these days it's stabilized. I don't think > that's unusual, particularly for something that attempts to do certain > things very differently. I would've loved to start with our current > state, but I don't think years of being out of tree would've completely > solved that. Some things you just don't find until it's in tree, > unfortunately. Well, the main thing I would've liked is a bit more discussion in the early days of io_uring; there are things we could've done differently back then that could've got us something cleaner in the long run. My main complaints were always - yet another special purpose ringbuffer, and - yet another parallel syscall interface. We've got too many of those too (aio is another), and the API fragmentation is a real problem for userspace that just wants to be able to issue arbitrary syscalls asynchronously. IO uring could've just been serializing syscall numbers and arguments - that would have worked fine. Given the history of failed AIO replacements just wanting to shove in something working was understandable, but I don't think those would have been big asks. > > I'd also really like to see some more standardized mechanisms for "I'm a > > kernel thread doing work on behalf of some other user thread" - this > > comes up elsewhere, I'm talking with David Howells right now about > > fsconfig which is another place it is or will be coming up. > > That does exist, and it came from the io_uring side of needing exactly > that. This is why we have create_io_thread(). IMHO it's the only sane > way to do it, trying to guesstimate what happens deep down in a random > callstack, and setting things up appropriately, is impossible. This is > where most of the earlier day io_uring issues came from, and what I > referred to as a "poor initial design choice". Thanks, I wasn't aware of that - that's worth highlighting. I may switch thread_with_file to that, and the fsconfig work David and I are talking about can probably use it as well. We really should have something lighter weight that we can use for work items though, that's our standard mechanism for deferred work, not spinning up a whole kthread. We do have kthread_use_mm() - there's no reason we couldn't do an expanded version of that for all the other shared resources that need to be available. This was also another blocker in the other aborted AIO replacements, so reusing clone flags does seem like the most reasonable way to make progress here, but I would wager there's other stuff in task_struct that really should be shared and isn't. task_struct is up to 825 (!) lines now, which means good luck on even finding that stuff - maybe at some point we'll have to get a giant effort going to clean up and organize task_struct, like willy's been doing with struct page. > >> Those have since been rectified, and the code base is > >> stable and solid these days. > > > > good tests, code coverage analysis to verify, good syzbot coverage? > > 3x yes. Obviously I'm always going to say that tests could be better, > have better coverage, cover more things, because nothing is perfect (and > if you think it is, you're fooling yourself) and as a maintainer I want > perfect coverage. But we're pretty diligent these days about adding > tests for everything. And any regression or bug report always gets test > cases written. *nod* that's encouraging. Looking forward to the umount issue being fixed so I can re-enable it in my tests...
On Thu, May 30, 2024 at 03:09:41PM -0400, Josef Bacik wrote: > On Thu, May 30, 2024 at 06:17:29PM +0200, Bernd Schubert wrote: > > > > > > On 5/30/24 18:10, Kent Overstreet wrote: > > > On Thu, May 30, 2024 at 06:02:21PM +0200, Bernd Schubert wrote: > > >> Hmm, initially I had thought about writing my own ring buffer, but then > > >> io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > > >> need? From interface point of view, io-uring seems easy to use here, > > >> has everything we need and kind of the same thing is used for ublk - > > >> what speaks against io-uring? And what other suggestion do you have? > > >> > > >> I guess the same concern would also apply to ublk_drv. > > >> > > >> Well, decoupling from io-uring might help to get for zero-copy, as there > > >> doesn't seem to be an agreement with Mings approaches (sorry I'm only > > >> silently following for now). > > >> > > >> From our side, a customer has pointed out security concerns for io-uring. > > >> My thinking so far was to implemented the required io-uring pieces into > > >> an module and access it with ioctls... Which would also allow to > > >> backport it to RHEL8/RHEL9. > > > > > > Well, I've been starting to sketch out a ringbuffer() syscall, which > > > would work on any (supported) file descriptor and give you a ringbuffer > > > for reading or writing (or call it twice for both). > > > > > > That seems to be what fuse really wants, no? You're already using a file > > > descriptor and your own RPC format, you just want a faster > > > communications channel. > > > > Fine with me, if you have something better/simpler with less security > > concerns - why not. We just need a community agreement on that. > > > > Do you have something I could look at? > > FWIW I have no strong feelings between using iouring vs any other ringbuffer > mechanism we come up with in the future. > > That being said iouring is here now, is proven to work, and these are good > performance improvements. If in the future something else comes along that > gives us better performance then absolutely we should explore adding that > functionality. But this solves the problem today, and I need the problem solved > yesterday, so continuing with this patchset is very much a worthwhile > investment, one that I'm very happy you're tackling Bernd instead of me ;). > Thanks, I suspect a ringbuffer syscall will actually be simpler than switching to io_uring. Let me see if I can cook something up quickly - there's no rocket science here and this all stuff we've done before so it shouldn't take too long (famous last works...)
On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: > From: Bernd Schubert <bschubert@ddn.com> > > This adds support for uring communication between kernel and > userspace daemon using opcode the IORING_OP_URING_CMD. The basic > appraoch was taken from ublk. The patches are in RFC state, > some major changes are still to be expected. > First, thanks for tackling this, this is a pretty big project and pretty important, you've put a lot of work into this and it's pretty great. A few things that I've pointed out elsewhere, but bear repeating and keeping in mind for the entire patch series. 1. Make sure you've got changelogs. There's several patches that just don't have changelogs. I get things where it's like "add a mmap interface", but it would be good to have an explanation as to why you're adding it and what we hope to get out of that change. 2. Again as I stated elsewhere, you add a bunch of structs and stuff that aren't related to the current patch, which makes it difficult for me to work out what's needed or how it's used, so I go back and forth between the code and the patch a lot, and I've probably missed a few things. 3. Drop the CPU scheduling changes for this first pass. The performance optimizations are definitely worth pursuing, but I don't want to get hung up in waiting on the scheduler dependencies to land. Additionally what looks like it works in your setup may not necessarily be good for everybody's setup. Having the baseline stuff in and working well, and then providing patches to change the CPU stuff in a way that we can test in a variety of different environments to validate the wins would be better. As someone who has to regularly go figure out what in the scheduler changed to make all of our metrics look bad again, I'm very wary of changes that make CPU selection policy decisions in a way that aren't changeable without changing the code. Thanks, Josef
On 5/30/24 1:35 PM, Kent Overstreet wrote: > On Thu, May 30, 2024 at 12:48:56PM -0600, Jens Axboe wrote: >> On 5/30/24 11:58 AM, Kent Overstreet wrote: >>> On Thu, May 30, 2024 at 11:28:43AM -0600, Jens Axboe wrote: >>>> I have addressed it several times in the past. tldr is that yeah the >>>> initial history of io_uring wasn't great, due to some unfortunate >>>> initial design choices (mostly around async worker setup and >>>> identities). >>> >>> Not to pick on you too much but the initial history looked pretty messy >>> to me - a lot of layering violations - it made aio.c look clean. >> >> Oh I certainly agree, the initial code was in a much worse state than it >> is in now. Lots of things have happened there, like splitting things up >> and adding appropriate layering. That was more of a code hygiene kind of >> thing, to make it easier to understand, maintain, and develop. >> >> Any new subsystem is going to see lots of initial churn, regardless of >> how long it's been developed before going into upstream. We certainly >> had lots of churn, where these days it's stabilized. I don't think >> that's unusual, particularly for something that attempts to do certain >> things very differently. I would've loved to start with our current >> state, but I don't think years of being out of tree would've completely >> solved that. Some things you just don't find until it's in tree, >> unfortunately. > > Well, the main thing I would've liked is a bit more discussion in the > early days of io_uring; there are things we could've done differently > back then that could've got us something cleaner in the long run. No matter how much discussion would've been had, there always would've been compromises or realizations that "yeah that thing there should've been done differently". In any case, pointless to argue about that, as the only thing we can change is how things look in the present and future. At least I don't have a time machine. > My main complaints were always > - yet another special purpose ringbuffer, and > - yet another parallel syscall interface. Exactly how many "parallel syscall interfaces" do we have? > We've got too many of those too (aio is another), and the API Like which ones? aio is a special case async interface for O_DIRECT IO, that's about it. It's not a generic IO interface. It's literally dio only. And yes, then it has the option of syncing a file, and poll got added some years ago as well. But for the longest duration of aio, it was just dio aio. The early versions of io_uring actually added on top of that, but I didn't feel like it belonged there. > fragmentation is a real problem for userspace that just wants to be able > to issue arbitrary syscalls asynchronously. IO uring could've just been > serializing syscall numbers and arguments - that would have worked fine. That doesn't work at all. If all syscalls had been designed with issue + wait semantics, then yeah that would obviously be the way that it would've been done. You just add all of them, and pass arguments, done. But that's not reality. You can do that if you just offload to a worker thread, but that's not how you get performance. And you could very much STILL do just that, in fact it'd be trivial to wire up. But nobody would use it, because something that just always punts to a thread would be awful for performance. You may as well just do that offload in userspace then. Hence the majority of the work for wiring up operations that implement existing functionality in an async way is core work. The io_uring interface for it is the simplest part, once you have the underpinnings doing what you want. Sometimes that's some ugly "this can't block, if it does, return -EAGAIN", and sometimes it's refactoring things a bit so that you can tap into the inevitable waitqueue. There's no single recipe, it all depends on how it currently works. > Given the history of failed AIO replacements just wanting to shove in > something working was understandable, but I don't think those would have > been big asks. What are these failed AIO replacements? aio is for storage, io_uring was never meant to be a storage only interface. The only other attempt I can recall, back in the day, was the acall and threadlet stuff that Ingo and zab worked on. And even that attempted to support async in a performant way, by doing work inline whenever possible. But hard to use, as you'd return as a different pid if the original task blocked. >>> I'd also really like to see some more standardized mechanisms for "I'm a >>> kernel thread doing work on behalf of some other user thread" - this >>> comes up elsewhere, I'm talking with David Howells right now about >>> fsconfig which is another place it is or will be coming up. >> >> That does exist, and it came from the io_uring side of needing exactly >> that. This is why we have create_io_thread(). IMHO it's the only sane >> way to do it, trying to guesstimate what happens deep down in a random >> callstack, and setting things up appropriately, is impossible. This is >> where most of the earlier day io_uring issues came from, and what I >> referred to as a "poor initial design choice". > > Thanks, I wasn't aware of that - that's worth highlighting. I may switch > thread_with_file to that, and the fsconfig work David and I are talking > about can probably use it as well. > > We really should have something lighter weight that we can use for work > items though, that's our standard mechanism for deferred work, not > spinning up a whole kthread. We do have kthread_use_mm() - there's no > reason we couldn't do an expanded version of that for all the other > shared resources that need to be available. Like io-wq does for io_uring? That's why it's there. io_uring tries not to rely on it very much, it's considered the slow path for the above mentioned reasons of why thread offload generally isn't a great idea. But at least it doesn't require a full fork for each item. > This was also another blocker in the other aborted AIO replacements, so > reusing clone flags does seem like the most reasonable way to make > progress here, but I would wager there's other stuff in task_struct that > really should be shared and isn't. task_struct is up to 825 (!) lines > now, which means good luck on even finding that stuff - maybe at some > point we'll have to get a giant effort going to clean up and organize > task_struct, like willy's been doing with struct page. Well that thing is an unwieldy beast and has been for many years. So yeah, very much agree that it needs some tender love and care, and we'd be better off for it. >>>> Those have since been rectified, and the code base is >>>> stable and solid these days. >>> >>> good tests, code coverage analysis to verify, good syzbot coverage? >> >> 3x yes. Obviously I'm always going to say that tests could be better, >> have better coverage, cover more things, because nothing is perfect (and >> if you think it is, you're fooling yourself) and as a maintainer I want >> perfect coverage. But we're pretty diligent these days about adding >> tests for everything. And any regression or bug report always gets test >> cases written. > > *nod* that's encouraging. Looking forward to the umount issue being > fixed so I can re-enable it in my tests... I'll pick it up again soon enough, I'll let you know.
On Fri, May 31, 2024 at 12:21 AM Jens Axboe <axboe@kernel.dk> wrote: > > On 5/30/24 10:02 AM, Bernd Schubert wrote: > > > > > > On 5/30/24 17:36, Kent Overstreet wrote: > >> On Wed, May 29, 2024 at 08:00:35PM +0200, Bernd Schubert wrote: > >>> From: Bernd Schubert <bschubert@ddn.com> > >>> > >>> This adds support for uring communication between kernel and > >>> userspace daemon using opcode the IORING_OP_URING_CMD. The basic > >>> appraoch was taken from ublk. The patches are in RFC state, > >>> some major changes are still to be expected. > >>> > >>> Motivation for these patches is all to increase fuse performance. > >>> In fuse-over-io-uring requests avoid core switching (application > >>> on core X, processing of fuse server on random core Y) and use > >>> shared memory between kernel and userspace to transfer data. > >>> Similar approaches have been taken by ZUFS and FUSE2, though > >>> not over io-uring, but through ioctl IOs > >> > >> What specifically is it about io-uring that's helpful here? Besides the > >> ringbuffer? > >> > >> So the original mess was that because we didn't have a generic > >> ringbuffer, we had aio, tracing, and god knows what else all > >> implementing their own special purpose ringbuffers (all with weird > >> quirks of debatable or no usefulness). > >> > >> It seems to me that what fuse (and a lot of other things want) is just a > >> clean simple easy to use generic ringbuffer for sending what-have-you > >> back and forth between the kernel and userspace - in this case RPCs from > >> the kernel to userspace. > >> > >> But instead, the solution seems to be just toss everything into a new > >> giant subsystem? > > > > > > Hmm, initially I had thought about writing my own ring buffer, but then > > io-uring got IORING_OP_URING_CMD, which seems to have exactly what we > > need? From interface point of view, io-uring seems easy to use here, > > has everything we need and kind of the same thing is used for ublk - > > what speaks against io-uring? And what other suggestion do you have? > > > > I guess the same concern would also apply to ublk_drv. > > > > Well, decoupling from io-uring might help to get for zero-copy, as there > > doesn't seem to be an agreement with Mings approaches (sorry I'm only > > silently following for now). We have concluded pipe & splice isn't good for zero copy, and io_uring provides zc in async way, which is really nice for async application. > > If you have an interest in the zero copy, do chime in, it would > certainly help get some closure on that feature. I don't think anyone > disagrees it's a useful and needed feature, but there are different view > points on how it's best solved. Now generic sqe group feature is being added, and generic zero copy can be built over it easily, can you or anyone take a look? https://lore.kernel.org/linux-block/20240511001214.173711-1-ming.lei@redhat.com/ Thanks, Ming
On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote: > > From: Bernd Schubert <bschubert@ddn.com> > > This adds support for uring communication between kernel and > userspace daemon using opcode the IORING_OP_URING_CMD. The basic > appraoch was taken from ublk. The patches are in RFC state, > some major changes are still to be expected. Thank you very much for tackling this. I think this is an important feature and one that could potentially have a significant effect on fuse performance, which is something many people would love to see. I'm thinking about the architecture and there are some questions: Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV? That's would just be the async part, without the mapped buffer. I suspect most of the performance advantage comes from this and the per-CPU queue, not from the mapped buffer, yet most of the complexity seems to be related to the mapped buffer. Maybe there's an advantage in using an atomic op for WRITEV + READV, but I'm not quite seeing it yet, since there's no syscall overhead for separate ops. What's the reason for separate async and sync request queues? > Avoiding cache line bouncing / numa systems was discussed > between Amir and Miklos before and Miklos had posted > part of the private discussion here > https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ > > This cache line bouncing should be addressed by these patches > as well. Why do you think this is solved? > I had also noticed waitq wake-up latencies in fuse before > https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/ > > This spinning approach helped with performance (>40% improvement > for file creates), but due to random server side thread/core utilization > spinning cannot be well controlled in /dev/fuse mode. > With fuse-over-io-uring requests are handled on the same core > (sync requests) or on core+1 (large async requests) and performance > improvements are achieved without spinning. I feel this should be a scheduler decision, but the selecting the queue needs to be based on that decision. Maybe the scheduler people can help out with this. Thanks, Miklos
On 6/11/24 10:20, Miklos Szeredi wrote: > On Wed, 29 May 2024 at 20:01, Bernd Schubert <bschubert@ddn.com> wrote: >> >> From: Bernd Schubert <bschubert@ddn.com> >> >> This adds support for uring communication between kernel and >> userspace daemon using opcode the IORING_OP_URING_CMD. The basic >> appraoch was taken from ublk. The patches are in RFC state, >> some major changes are still to be expected. > > Thank you very much for tackling this. I think this is an important > feature and one that could potentially have a significant effect on > fuse performance, which is something many people would love to see. > > I'm thinking about the architecture and there are some questions: > > Have you tried just plain IORING_OP_READV / IORING_OP_WRITEV? That's > would just be the async part, without the mapped buffer. I suspect > most of the performance advantage comes from this and the per-CPU > queue, not from the mapped buffer, yet most of the complexity seems to > be related to the mapped buffer. I didn't try because IORING_OP_URING_CMD seems to be exactly made for our case. Firstly and although I didn't have time to look into it yet, but with the current approach it should be rather simple to switch to another ring as Kent has suggested. Secondly, with IORING_OP_URING_CMD we already have only a single command to submit requests and fetch the next one - half of the system calls. Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? https://github.com/uroni/fuseuring? I.e. it hook into the existing fuse and just changes from read()/write() of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero control which ring/queue and which ring-entry handles the request. Thirdly, initially I had even removed the allocation of 'struct fuse_req' and directly allocated these on available ring entries. I.e. the application thread was writing to the mmap ring buffer. I just removed that code for now, as it introduced additional complexity with an unknown performance outcome. If it should be helpful we could add that later. I don't think we have that flexibility with IORING_OP_READV/IORING_OP_WRITEV. And then, personally I do not see mmap complexity - it is just very convenient to write/read to/from the ring buffer from any kernel thread. Currently not supported by io-uring, but I think we could even avoid any kind of system call and let the application poll for results. Similar to what IORING_SETUP_SQPOLL does, but without the requirement of another kernel thread. Most complexity and issues I got, come from the requirement of io_uring to complete requests with io_uring_op_cmd_done. In RFCv1 you had annotated the async shutdown method - that was indeed really painful and resulted in a never ending number of shutdown races. Once I removed that and also removed using a bitmap (I don't even know why I used a bitmap in the first place in RFCv1 and not lists as in RFCv2) shutdown became manageable. If there would be way to tell io-uring that kernel fuse is done and to let it complete itself whatever was not completed yet, it would would be great help. Although from my point of view, that could be done once this series is merged. Or we decide to switch to Kents new ring buffer and might not have that problem at all... > > Maybe there's an advantage in using an atomic op for WRITEV + READV, > but I'm not quite seeing it yet, since there's no syscall overhead for > separate ops. Here I get confused, could please explain? Current fuse has a read + write operation - a read() system call to process a fuse request and a write() call to submit the result and then read() to fetch the next request. If userspace has to send IORING_OP_READV to fetch a request and complete with IORING_OP_IORING_OP_WRITEV it would go through existing code path with operations? Well, maybe userspace could submit with IOSQE_IO_LINK, but that sounds like it would need to send two ring entries? I.e. memory and processing overhead? And then, no way to further optimize and do fuse_req allocation on the ring (if there are free entries). And probably also no way that we ever let the application work in the SQPOLL way, because the application thread does not have the right to read from the fuse-server buffer? I.e. what I mean is that IORING_OP_URING_CMD gives a better flexibility. Btw, another issue that is avoided with the new ring-request layout is compatibility and alignment. The fuse header is always in a 4K section of the request data follow then. I.e. extending request sizes does not impose compatibility issues any more and also for passthrough and similar - O_DIRECT can be passed through to the backend file system. Although these issues probably need to be solved into the current fuse protocol. > > What's the reason for separate async and sync request queues? To have credits for IO operations. For an overlay file system it might not matter, because it might get stuck with another system call in the underlying file system. But for something that is not system call bound and that has control, async and sync can be handled with priority given by userspace. As I had written in the introduction mail, I'm currently working on different IO sizes per ring queue - it gets even more fine grained and with the advantage of reduced memory usage per queue when the queue has entries for many small requests and a few large ones. Next step would here to add credits for reads/writes (or to reserve credits for meta operations) in the sync queue, so that meta operations can always go through. If there should be async meta operations (through application io-uring requests?) would need to be done for the async queue as well. Last but not least, with separation there is no global async queue anymore - no global lock and cache issues. > >> Avoiding cache line bouncing / numa systems was discussed >> between Amir and Miklos before and Miklos had posted >> part of the private discussion here >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ >> >> This cache line bouncing should be addressed by these patches >> as well. > > Why do you think this is solved? I _guess_ that writing to the mmaped buffer and processing requests on the same cpu core should make it possible to keep things in cpu cache. I did not verify that with perf, though. > >> I had also noticed waitq wake-up latencies in fuse before >> https://lore.kernel.org/lkml/9326bb76-680f-05f6-6f78-df6170afaa2c@fastmail.fm/T/ >> >> This spinning approach helped with performance (>40% improvement >> for file creates), but due to random server side thread/core utilization >> spinning cannot be well controlled in /dev/fuse mode. >> With fuse-over-io-uring requests are handled on the same core >> (sync requests) or on core+1 (large async requests) and performance >> improvements are achieved without spinning. > > I feel this should be a scheduler decision, but the selecting the > queue needs to be based on that decision. Maybe the scheduler people > can help out with this. For sync requests getting the scheduler involved is what is responsible for making really fuse slow. It schedules on random cores, that are in sleep states and additionally frequency scaling does not go up. We really need to stay on the same core here, as that is submitting the result, the core is already running (i.e. not sleeping) and has data in its cache. All benchmark results with sync requests point that out. For async requests, the above still applies, but basically one thread is writing/reading and the other thread handles/provides the data. Random switching of cores is then still not good. At best and to be tested, submitting rather large chunks to other cores. What is indeed to be discussed (and think annotated in the corresponding patch description), if there is a way to figure out if the other core is already busy. But then the scheduler does not know what it is busy with - are these existing fuse requests or something else. That part is really hard and I don't think it makes sense to discuss this right now before the main part is merged. IMHO, better to add a config flag for the current cpu+1 scheduling with an annotation that this setting might go away in the future. Thanks, Bernd
On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > Secondly, with IORING_OP_URING_CMD we already have only a single command > to submit requests and fetch the next one - half of the system calls. > > Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? > https://github.com/uroni/fuseuring? > I.e. it hook into the existing fuse and just changes from read()/write() > of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero > control which ring/queue and which ring-entry handles the request. Unlike system calls, io_uring ops should have very little overhead. That's one of the main selling points of io_uring (as described in the io_uring(7) man page). So I don't think it matters to performance whether there's a combined WRITEV + READV (or COMMIT + FETCH) op or separate ops. The advantage of separate ops is more flexibility and less complexity (do only one thing in an op). > Thirdly, initially I had even removed the allocation of 'struct > fuse_req' and directly allocated these on available ring entries. I.e. > the application thread was writing to the mmap ring buffer. I just > removed that code for now, as it introduced additional complexity with > an unknown performance outcome. If it should be helpful we could add > that later. I don't think we have that flexibility with > IORING_OP_READV/IORING_OP_WRITEV. I think I understand what you'd like to see in the end: basically a reverse io_uring, where requests are placed on a "submission queue" by the kernel and completed requests are placed on a completion queue by the userspace. Exactly the opposite of current io_uring. The major difference between your idea of a fuse_uring and the io_uring seems to be that you place not only the request on the shared buffer, but the data as well. I don't think this is a good idea, since it will often incur one more memory copy. Otherwise the idea itself seems sound. The implementation twisted due to having to integrate it with io_uring. Unfortunately placing fuse requests directly into the io_uring queue doesn't work, due to the reversal of roles and the size difference between sqe and cqe entries. Also the shared buffer seems to lose its ring aspect due to the fact that fuse doesn't get notified when a request is taken off the queue (io_uring_cqe_seen(3)). So I think either better integration with io_uring is needed with support for "reverse submission" or a new interface. > > > > Maybe there's an advantage in using an atomic op for WRITEV + READV, > > but I'm not quite seeing it yet, since there's no syscall overhead for > > separate ops. > > Here I get confused, could please explain? > Current fuse has a read + write operation - a read() system call to > process a fuse request and a write() call to submit the result and then > read() to fetch the next request. > If userspace has to send IORING_OP_READV to fetch a request and complete > with IORING_OP_IORING_OP_WRITEV it would go through existing code path > with operations? Well, maybe userspace could submit with IOSQE_IO_LINK, > but that sounds like it would need to send two ring entries? I.e. memory > and processing overhead? Overhead should be minimal. > And then, no way to further optimize and do fuse_req allocation on the > ring (if there are free entries). And probably also no way that we ever > let the application work in the SQPOLL way, because the application > thread does not have the right to read from the fuse-server buffer? I.e. > what I mean is that IORING_OP_URING_CMD gives a better flexibility. There should be no difference between IORING_OP_URING_CMD and IORING_OP_WRITEV + IORING_OP_READV in this respect. At least I don't see why polling would work differently: the writev should complete immediately and then the readv is queued. Same as what effectively happens with IORING_OP_URING_CMD, no? > Btw, another issue that is avoided with the new ring-request layout is > compatibility and alignment. The fuse header is always in a 4K section > of the request data follow then. I.e. extending request sizes does not > impose compatibility issues any more and also for passthrough and > similar - O_DIRECT can be passed through to the backend file system. > Although these issues probably need to be solved into the current fuse > protocol. Yes. > Last but not least, with separation there is no global async queue > anymore - no global lock and cache issues. The global async code should be moved into the /dev/fuse specific "legacy" queuing so it doesn't affect either uring or virtiofs queuing. > >> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ > >> > >> This cache line bouncing should be addressed by these patches > >> as well. > > > > Why do you think this is solved? > > > I _guess_ that writing to the mmaped buffer and processing requests on > the same cpu core should make it possible to keep things in cpu cache. I > did not verify that with perf, though. Well, the issue is with any context switch that happens in the multithreaded fuse server process. Shared address spaces will have a common "which CPU this is currently running on" bitmap (mm->cpu_bitmap), which is updated whenever one of the threads using this address space gets scheduled or descheduled. Now imagine a fuse server running on a big numa system, which has threads bound to each CPU. The server tries to avoid using sharing data structures between threads, so that cache remains local. But it can't avoid updating this bitmap on schedule. The bitmap can pack 512 CPUs into a single cacheline, which means that thread locality is compromised. I'm somewhat surprised that this doesn't turn up in profiles in real life, but I guess it's not a big deal in most cases. I only observed it with a special "no-op" fuse server running on big numa and with per-thread queuing, etc. enabled (fuse2). > For sync requests getting the scheduler involved is what is responsible > for making really fuse slow. It schedules on random cores, that are in > sleep states and additionally frequency scaling does not go up. We > really need to stay on the same core here, as that is submitting the > result, the core is already running (i.e. not sleeping) and has data in > its cache. All benchmark results with sync requests point that out. No arguments about that. > For async requests, the above still applies, but basically one thread is > writing/reading and the other thread handles/provides the data. Random > switching of cores is then still not good. At best and to be tested, > submitting rather large chunks to other cores. > What is indeed to be discussed (and think annotated in the corresponding > patch description), if there is a way to figure out if the other core is > already busy. But then the scheduler does not know what it is busy with > - are these existing fuse requests or something else. That part is > really hard and I don't think it makes sense to discuss this right now > before the main part is merged. IMHO, better to add a config flag for > the current cpu+1 scheduling with an annotation that this setting might > go away in the future. The cpu + 1 seems pretty arbitrary, and could be a very bad decision if there are independent tasks bound to certain CPUs or if the target turns out to be on a very distant CPU. I'm not sure what the right answer is. It's probably something like: try to schedule this on a CPU which is not busy but is not very distant from this one. The scheduler can probably answer this question, but there's no current API for this. Thanks, Miklos
On 6/11/24 17:35, Miklos Szeredi wrote: > On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > >> Secondly, with IORING_OP_URING_CMD we already have only a single command >> to submit requests and fetch the next one - half of the system calls. >> >> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? >> https://github.com/uroni/fuseuring? >> I.e. it hook into the existing fuse and just changes from read()/write() >> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero >> control which ring/queue and which ring-entry handles the request. > > Unlike system calls, io_uring ops should have very little overhead. > That's one of the main selling points of io_uring (as described in the > io_uring(7) man page). > > So I don't think it matters to performance whether there's a combined > WRITEV + READV (or COMMIT + FETCH) op or separate ops. This has to be performance proven and is no means what I'm seeing. How should io-uring improve performance if you have the same number of system calls? As I see it (@Jens or @Pavel or anyone else please correct me if I'm wrong), advantage of io-uring comes when there is no syscall overhead at all - either you have a ring with multiple entries and then one side operates on multiple entries or you have polling and no syscall overhead either. We cannot afford cpu intensive polling - out of question, besides that I had even tried SQPOLL and it made things worse (that is actually where my idea about application polling comes from). As I see it, for sync blocking calls (like meta operations) with one entry in the queue, you would get no advantage with IORING_OP_READV/IORING_OP_WRITEV - io-uring has do two system calls - one to submit from kernel to userspace and another from userspace to kernel. Why should io-uring be faster there? And from my testing this is exactly what I had seen - io-uring for meta requests (i.e. without a large request queue and *without* core affinity) makes meta operations even slower that /dev/fuse. For anything that imposes a large ring queue and where either side (kernel or userspace) needs to process multiple ring entries - system call overhead gets reduced by the queue size. Just for DIO or meta operations that is hard to reach. Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would change in fuse kernel? I.e. IOs would go via fuse_dev_read()? I.e. we would not have encoded in the request which queue it belongs to? > > The advantage of separate ops is more flexibility and less complexity > (do only one thing in an op) Did you look at patch 12/19? It just does fuse_uring_req_end_and_get_next(). That part isn't complex, imho. > >> Thirdly, initially I had even removed the allocation of 'struct >> fuse_req' and directly allocated these on available ring entries. I.e. >> the application thread was writing to the mmap ring buffer. I just >> removed that code for now, as it introduced additional complexity with >> an unknown performance outcome. If it should be helpful we could add >> that later. I don't think we have that flexibility with >> IORING_OP_READV/IORING_OP_WRITEV. > > I think I understand what you'd like to see in the end: basically a > reverse io_uring, where requests are placed on a "submission queue" by > the kernel and completed requests are placed on a completion queue by > the userspace. Exactly the opposite of current io_uring. > > The major difference between your idea of a fuse_uring and the > io_uring seems to be that you place not only the request on the shared > buffer, but the data as well. I don't think this is a good idea, > since it will often incur one more memory copy. Otherwise the idea > itself seems sound. Coud you explain what you mean with "one more memory copy"? As it is right now, 'struct fuse_req' is always allocated as it was before and then a copy is done to the ring entry. No difference to legacy /dev/fuse IO, which also copies to the read buffer. If we would avoid allocating 'struct fuse_req' when there are free ring entry requests we would reduce copies, but never increase? Btw, advantage for the ring is on the libfuse side, where the fuse-request buffer is assigned to the CQE and as long as the request is not completed, the buffer is valid. (For /dev/fuse IO that could be solved in libfuse by decoupling request memory from the thread, but with the current ring buffer design that just happens more naturally and memory is limited by the queue size.) > > The implementation twisted due to having to integrate it with > io_uring. Unfortunately placing fuse requests directly into the > io_uring queue doesn't work, due to the reversal of roles and the size > difference between sqe and cqe entries. Also the shared buffer seems > to lose its ring aspect due to the fact that fuse doesn't get notified > when a request is taken off the queue (io_uring_cqe_seen(3)). > > So I think either better integration with io_uring is needed with > support for "reverse submission" or a new interface. Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And ublk_drv also works exactly that way. I had pointed it out before, initially I had considered to write a reverse io-uring myself and then exactly at that time ublk came up. The interface of that 'reverse io' to io-uring is really simple. 1) Userspace sends a IORING_OP_URING_CMD SQE 2) That CMD gets handled/queued by struct file_operations::uring_cmd / fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the request 3) When fuse client has data to complete the request, it calls io_uring_cmd_done() and fuse server receives a CQE with the fuse request. Personally I don't see anything twisted here, one just needs to understand that IORING_OP_URING_CMD was written for that reverse order. (There came up a light twisting when io-uring introduced issue_flags - that is part of discussion of patch 19/19 with Jens in the series. Jens suggested to work on io-uring improvements once the main series is merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask Jens for help once the other parts are merged. Right now that easy to work around by always submitting with an io-uring task). Also, that simplicity is the reason why I'm hesitating a bit to work on Kents new ring, as io-uring already has all what we need and with a rather simple interface. Well, maybe you mean patch 09/19 "Add a dev_release exception for fuse-over-io-uring". Yep, that is the shutdown part I'm not too happy about and which initially lead to the async release thread in RFCv1. > >>> >>> Maybe there's an advantage in using an atomic op for WRITEV + READV, >>> but I'm not quite seeing it yet, since there's no syscall overhead for >>> separate ops. >> >> Here I get confused, could please explain? >> Current fuse has a read + write operation - a read() system call to >> process a fuse request and a write() call to submit the result and then >> read() to fetch the next request. >> If userspace has to send IORING_OP_READV to fetch a request and complete >> with IORING_OP_IORING_OP_WRITEV it would go through existing code path >> with operations? Well, maybe userspace could submit with IOSQE_IO_LINK, >> but that sounds like it would need to send two ring entries? I.e. memory >> and processing overhead? > > Overhead should be minimal. See above, for single entry blocking requests you get two system calls + io-uring overhead. > >> And then, no way to further optimize and do fuse_req allocation on the >> ring (if there are free entries). And probably also no way that we ever >> let the application work in the SQPOLL way, because the application >> thread does not have the right to read from the fuse-server buffer? I.e. >> what I mean is that IORING_OP_URING_CMD gives a better flexibility. > > There should be no difference between IORING_OP_URING_CMD and > IORING_OP_WRITEV + IORING_OP_READV in this respect. At least I don't > see why polling would work differently: the writev should complete > immediately and then the readv is queued. Same as what effectively > happens with IORING_OP_URING_CMD, no? Polling yes, but without shared memory the application thread does not have the right to read from fuse userspace server request? > >> Btw, another issue that is avoided with the new ring-request layout is >> compatibility and alignment. The fuse header is always in a 4K section >> of the request data follow then. I.e. extending request sizes does not >> impose compatibility issues any more and also for passthrough and >> similar - O_DIRECT can be passed through to the backend file system. >> Although these issues probably need to be solved into the current fuse >> protocol. > > Yes. > >> Last but not least, with separation there is no global async queue >> anymore - no global lock and cache issues. > > The global async code should be moved into the /dev/fuse specific > "legacy" queuing so it doesn't affect either uring or virtiofs > queuing. Yep, wait a few days I have seen your recent patch and I'm may to add that to my series. I actually considered to point of that the bg queue could be handled by that series as well, but then preferred to just add patch for that in my series, which will make use of it for the ring queue. > >>>> https://lore.kernel.org/linux-fsdevel/CAJfpegtL3NXPNgK1kuJR8kLu3WkVC_ErBPRfToLEiA_0=w3=hA@mail.gmail.com/ >>>> >>>> This cache line bouncing should be addressed by these patches >>>> as well. >>> >>> Why do you think this is solved? >> >> >> I _guess_ that writing to the mmaped buffer and processing requests on >> the same cpu core should make it possible to keep things in cpu cache. I >> did not verify that with perf, though. > > Well, the issue is with any context switch that happens in the > multithreaded fuse server process. Shared address spaces will have a > common "which CPU this is currently running on" bitmap > (mm->cpu_bitmap), which is updated whenever one of the threads using > this address space gets scheduled or descheduled. > > Now imagine a fuse server running on a big numa system, which has > threads bound to each CPU. The server tries to avoid using sharing > data structures between threads, so that cache remains local. But it > can't avoid updating this bitmap on schedule. The bitmap can pack 512 > CPUs into a single cacheline, which means that thread locality is > compromised. To be honest, I wonder how you worked around scheduler issues on waking up the application thread. Did you core bind application threads as well (I mean besides fuse server threads)? We now have this (unexported) wake_on_current_cpu. Last year that still wasn't working perfectly well and Hillf Danton has suggested the 'seesaw' approach. And with that the scheduler was working very well. You could get the same with application core binding, but with 512 CPUs that is certainly not done manually anymore. Did you use a script to bind application threads or did you core bind from within the application? > > I'm somewhat surprised that this doesn't turn up in profiles in real > life, but I guess it's not a big deal in most cases. I only observed > it with a special "no-op" fuse server running on big numa and with > per-thread queuing, etc. enabled (fuse2). Ok, I'm testing only with 32 cores and two numa nodes. For final benchmarking I could try to get a more recent AMD based system with 96 cores. I don't think we have anything near 512 CPUs in the lab. I'm not aware of such customer systems either. > >> For sync requests getting the scheduler involved is what is responsible >> for making really fuse slow. It schedules on random cores, that are in >> sleep states and additionally frequency scaling does not go up. We >> really need to stay on the same core here, as that is submitting the >> result, the core is already running (i.e. not sleeping) and has data in >> its cache. All benchmark results with sync requests point that out. > > No arguments about that. > >> For async requests, the above still applies, but basically one thread is >> writing/reading and the other thread handles/provides the data. Random >> switching of cores is then still not good. At best and to be tested, >> submitting rather large chunks to other cores. >> What is indeed to be discussed (and think annotated in the corresponding >> patch description), if there is a way to figure out if the other core is >> already busy. But then the scheduler does not know what it is busy with >> - are these existing fuse requests or something else. That part is >> really hard and I don't think it makes sense to discuss this right now >> before the main part is merged. IMHO, better to add a config flag for >> the current cpu+1 scheduling with an annotation that this setting might >> go away in the future. > > The cpu + 1 seems pretty arbitrary, and could be a very bad decision > if there are independent tasks bound to certain CPUs or if the target > turns out to be on a very distant CPU. > > I'm not sure what the right answer is. It's probably something like: > try to schedule this on a CPU which is not busy but is not very > distant from this one. The scheduler can probably answer this > question, but there's no current API for this. This is just another optimization. What you write is true and I was aware of that. It was just a rather simple optimization that improved results - enough to demo it. We can work with scheduler people in the future on that or we add a bit of our own logic and create mapping of cpu -> next-cpu-on-same-numa-node. Certainly an API and help from the scheduler would be preferred. Thanks, Bernd
On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote: > > > On 6/11/24 17:35, Miklos Szeredi wrote: > > On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > > >> Secondly, with IORING_OP_URING_CMD we already have only a single command > >> to submit requests and fetch the next one - half of the system calls. > >> > >> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? > >> https://github.com/uroni/fuseuring? > >> I.e. it hook into the existing fuse and just changes from read()/write() > >> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero > >> control which ring/queue and which ring-entry handles the request. > > > > Unlike system calls, io_uring ops should have very little overhead. > > That's one of the main selling points of io_uring (as described in the > > io_uring(7) man page). > > > > So I don't think it matters to performance whether there's a combined > > WRITEV + READV (or COMMIT + FETCH) op or separate ops. > > This has to be performance proven and is no means what I'm seeing. How > should io-uring improve performance if you have the same number of > system calls? > > As I see it (@Jens or @Pavel or anyone else please correct me if I'm > wrong), advantage of io-uring comes when there is no syscall overhead at > all - either you have a ring with multiple entries and then one side > operates on multiple entries or you have polling and no syscall overhead > either. We cannot afford cpu intensive polling - out of question, > besides that I had even tried SQPOLL and it made things worse (that is > actually where my idea about application polling comes from). > As I see it, for sync blocking calls (like meta operations) with one > entry in the queue, you would get no advantage with > IORING_OP_READV/IORING_OP_WRITEV - io-uring has do two system calls - > one to submit from kernel to userspace and another from userspace to > kernel. Why should io-uring be faster there? > > And from my testing this is exactly what I had seen - io-uring for meta > requests (i.e. without a large request queue and *without* core > affinity) makes meta operations even slower that /dev/fuse. > > For anything that imposes a large ring queue and where either side > (kernel or userspace) needs to process multiple ring entries - system > call overhead gets reduced by the queue size. Just for DIO or meta > operations that is hard to reach. > > Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would > change in fuse kernel? I.e. IOs would go via fuse_dev_read()? > I.e. we would not have encoded in the request which queue it belongs to? Want to try out my new ringbuffer syscall? I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only skimmed. But using it to replace the read/write syscall overhead should be straightforward; you'll want to spin up a kthread for responding to requests. The next thing I was going to look at is how you guys are using splice, we want to get away from that too. Brian was also saying the fuse virtio_fs code may be worth investigating, maybe that could be adapted?
On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > > So I don't think it matters to performance whether there's a combined > > WRITEV + READV (or COMMIT + FETCH) op or separate ops. > > This has to be performance proven and is no means what I'm seeing. How > should io-uring improve performance if you have the same number of > system calls? The ops can be queued together and submitted together. Two separate (but possibly linked) ops should result in exactly the same number of syscalls as a single combined op. > Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would > change in fuse kernel? I.e. IOs would go via fuse_dev_read()? > I.e. we would not have encoded in the request which queue it belongs to? The original idea was to use the cloned /dev/fuse fd to sort requests into separate queues. That was only half finished: the input queue is currently shared by all clones, but once a request is read by the server from a particular clone it is put into a separate processing queue. Adding separate input queues to each clone should also be possible. I'm not saying this is definitely the direction we should be taking, but it's something to consider. > > The advantage of separate ops is more flexibility and less complexity > > (do only one thing in an op) > > Did you look at patch 12/19? It just does > fuse_uring_req_end_and_get_next(). That part isn't complex, imho. That function name indicates that this is too complex: it is doing two independent things (ending one request and fetching the next). Fine if it's a valid optimization, but I'm saying that it likely isn't. > > The major difference between your idea of a fuse_uring and the > > io_uring seems to be that you place not only the request on the shared > > buffer, but the data as well. I don't think this is a good idea, > > since it will often incur one more memory copy. Otherwise the idea > > itself seems sound. > > Coud you explain what you mean with "one more memory copy"? If the filesystem is providing the result of a READ request as a pointer to a buffer (which can be the case with fuse_reply_data()), then that buffer will need to be copied to the shared buffer, and from the shared buffer to the read destination. That first copy is unnecessary if the kernel receives the pointer to the userspace buffer and copies the data directly to the destination. > > So I think either better integration with io_uring is needed with > > support for "reverse submission" or a new interface. > > Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And > ublk_drv also works exactly that way. I had pointed it out before, > initially I had considered to write a reverse io-uring myself and then > exactly at that time ublk came up. I'm just looking for answers why this architecture is the best. Maybe it is, but I find it too complex and can't explain why it's going to perform better than a much simpler single ring model. > The interface of that 'reverse io' to io-uring is really simple. > > 1) Userspace sends a IORING_OP_URING_CMD SQE > 2) That CMD gets handled/queued by struct file_operations::uring_cmd / > fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the > request > 3) When fuse client has data to complete the request, it calls > io_uring_cmd_done() and fuse server receives a CQE with the fuse request. > > Personally I don't see anything twisted here, one just needs to > understand that IORING_OP_URING_CMD was written for that reverse order. That's just my gut feeling. fuse/dev_uring.c is 1233 in this RFC. And that's just the queuing. > (There came up a light twisting when io-uring introduced issue_flags - > that is part of discussion of patch 19/19 with Jens in the series. Jens > suggested to work on io-uring improvements once the main series is > merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask > Jens for help once the other parts are merged. Right now that easy to > work around by always submitting with an io-uring task). > > Also, that simplicity is the reason why I'm hesitating a bit to work on > Kents new ring, as io-uring already has all what we need and with a > rather simple interface. I'm in favor of using io_uring, if possible. I'm also in favor of a single shared buffer (ring) if possible. Using cloned fd + plain READV / WRITEV ops is one possibility. But I'm not opposed to IORING_OP_URING_CMD either. Btw, fuse reply could be inlined in the majority of cases into that 80 byte free space in the sqe. Also might consider an extended cqe mode, where short fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112 byte payload). > To be honest, I wonder how you worked around scheduler issues on waking > up the application thread. Did you core bind application threads as well > (I mean besides fuse server threads)? We now have this (unexported) > wake_on_current_cpu. Last year that still wasn't working perfectly well > and Hillf Danton has suggested the 'seesaw' approach. And with that the > scheduler was working very well. You could get the same with application > core binding, but with 512 CPUs that is certainly not done manually > anymore. Did you use a script to bind application threads or did you > core bind from within the application? Probably, I don't remember anymore. Thanks, Miklos
On 6/12/24 09:39, Miklos Szeredi wrote: > On Tue, 11 Jun 2024 at 19:37, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: > >>> So I don't think it matters to performance whether there's a combined >>> WRITEV + READV (or COMMIT + FETCH) op or separate ops. >> >> This has to be performance proven and is no means what I'm seeing. How >> should io-uring improve performance if you have the same number of >> system calls? > > The ops can be queued together and submitted together. Two separate > (but possibly linked) ops should result in exactly the same number of > syscalls as a single combined op. As I wrote before, that requires the double amount of queue buffer memory. Goes totally into the opposite direction of what I'm currently working on - to use less memory, as not all requests need 1MB buffers and as we want to use as little as possible memory. > >> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would >> change in fuse kernel? I.e. IOs would go via fuse_dev_read()? >> I.e. we would not have encoded in the request which queue it belongs to? > > The original idea was to use the cloned /dev/fuse fd to sort requests > into separate queues. That was only half finished: the input queue is > currently shared by all clones, but once a request is read by the > server from a particular clone it is put into a separate processing > queue. Adding separate input queues to each clone should also be > possible. > > I'm not saying this is definitely the direction we should be taking, > but it's something to consider. I was considering to use that for the mmap method, but then found it easier to just add per numa to an rb-tree. Well, maybe I should reconsider, as the current patch series clones the device anyway. > >>> The advantage of separate ops is more flexibility and less complexity >>> (do only one thing in an op) >> >> Did you look at patch 12/19? It just does >> fuse_uring_req_end_and_get_next(). That part isn't complex, imho. > > That function name indicates that this is too complex: it is doing two > independent things (ending one request and fetching the next). > > Fine if it's a valid optimization, but I'm saying that it likely isn't. Would it help, if it would be in two lines? Like fuse_uring_req_end(); fuse_uring_req_next(); It has to check if there are requests queued, as it goes on hold if not and then won't process the queue. > >>> The major difference between your idea of a fuse_uring and the >>> io_uring seems to be that you place not only the request on the shared >>> buffer, but the data as well. I don't think this is a good idea, >>> since it will often incur one more memory copy. Otherwise the idea >>> itself seems sound. >> >> Coud you explain what you mean with "one more memory copy"? > > If the filesystem is providing the result of a READ request as a > pointer to a buffer (which can be the case with fuse_reply_data()), > then that buffer will need to be copied to the shared buffer, and from > the shared buffer to the read destination. > > That first copy is unnecessary if the kernel receives the pointer to > the userspace buffer and copies the data directly to the destination. I didn't do that yet, as we are going to use the ring buffer for requests, i.e. the ring buffer immediately gets all the data from network, there is no copy. Even if the ring buffer would get data from local disk - there is no need to use a separate application buffer anymore. And with that there is just no extra copy Keep in mind that the ring buffers are coupled with the request and not the processing thread as in current libfuse - the buffer is valid as long as the request is not completed. That part is probably harder with IORING_OP_READV/IORING_OP_WRITEV, especially if you need two of them. Your idea sounds useful if userspace would have its own cache outside of ring buffers and that could be added in as another optimization as something like: - fuse-ring-req gets a user pointer - flag if user pointer it set - kernel acts on the flag - completion gets send to userspace by: - if next request is already in the queue : piggy-back completion into the next request - if no next request: send a separate completion message If you want, I could add that as another optimization patch to the next RFC. Maybe that is also useful for existing libfuse applications that are used to the fact that the request buffer is associated with the thread and not the request itself. If we want real zero copy, we can add in Mings work on that. I'm going to look into that today or tomorrow. > >>> So I think either better integration with io_uring is needed with >>> support for "reverse submission" or a new interface. >> >> Well, that is exactly what IORING_OP_URING_CMD is for, afaik. And >> ublk_drv also works exactly that way. I had pointed it out before, >> initially I had considered to write a reverse io-uring myself and then >> exactly at that time ublk came up. > > I'm just looking for answers why this architecture is the best. Maybe > it is, but I find it too complex and can't explain why it's going to > perform better than a much simpler single ring model. > >> The interface of that 'reverse io' to io-uring is really simple. >> >> 1) Userspace sends a IORING_OP_URING_CMD SQE >> 2) That CMD gets handled/queued by struct file_operations::uring_cmd / >> fuse_uring_cmd(). fuse_uring_cmd() returns -EIOCBQUEUED and queues the >> request >> 3) When fuse client has data to complete the request, it calls >> io_uring_cmd_done() and fuse server receives a CQE with the fuse request. >> >> Personally I don't see anything twisted here, one just needs to >> understand that IORING_OP_URING_CMD was written for that reverse order. > > That's just my gut feeling. fuse/dev_uring.c is 1233 in this RFC. > And that's just the queuing. Dunno, from my point of view the main logic is so much simpler than what fuse_dev_do_read() has - that function checks multiple queues, takes multiple locks, has to add the fuse-req to a (now hashed) list and has restart loop. If possible, I really wouldn't like to make that even more complex. But we want to add it: - Selecting different ring entries based on their io-size (what I'm currently adding in, to reduce memory per queue). I don't think that with the current wake logic that would be even possible - Add in credits for for different IO types to avoid that an IO type can fill the entire queue. Right now the ring has only have separation of sync/async, but I don't think that is enough. To compare, could you please check the code flow of FUSE_URING_REQ_COMMIT_AND_FETCH? I have no issue to split fuse_uring_req_end_and_get_next() into two functions. What I mean is that the code flow is hopefully not be hard to follow, it ends the request and then puts the entry on the avail lets Then it checks if there is pending fuse request and handles that, from my point of view that is much easier to follow and with less conditions than what fuse_dev_do_read() has. And that although there is already a separation of sync and async queues. > >> (There came up a light twisting when io-uring introduced issue_flags - >> that is part of discussion of patch 19/19 with Jens in the series. Jens >> suggested to work on io-uring improvements once the main series is >> merged. I.e. patch 19/19 will be dropped in RFCv3 and I'm going to ask >> Jens for help once the other parts are merged. Right now that easy to >> work around by always submitting with an io-uring task). >> >> Also, that simplicity is the reason why I'm hesitating a bit to work on >> Kents new ring, as io-uring already has all what we need and with a >> rather simple interface. > > I'm in favor of using io_uring, if possible. I have no objections, but I would also like to see an RHEL version with it... > > I'm also in favor of a single shared buffer (ring) if possible. Using > cloned fd + plain READV / WRITEV ops is one possibility. Cons: - READV / WRITEV would need to be coupled in order to avoid two io-uring cmd-enter system calls - double amount of memory - Not impossible, but harder to achieve - request buffer belongs to the request itself. - request data section and fuse header are not clearly separated, data alignment compat issues. - different IO sizes hard to impossible - with large queue sizes high memory usage, even if the majority would need small requests only - Request type credits much harder to achieve - The vfs application cannot directly write into the ring buffer, reduces future optimizations - new zero copy approach Ming Lei is working on cannot be used - Not very flexible for future additions, IMHO Pros: - probably less code additions - No shutdown issues - existing splice works > > But I'm not opposed to IORING_OP_URING_CMD either. Btw, fuse reply > could be inlined in the majority of cases into that 80 byte free space > in the sqe. Also might consider an extended cqe mode, where short > fuse request could be inlined as well (e.g. IORING_SETUP_CQE128 -> 112 > byte payload). That conflicts with that we want to have the fuse header in a separate section to avoid alignment and compat issues. In fact, we might even want to make that header section depending on the system page size. And then what would be the advantage, we have the buffer anyway? Thanks, Bernd PS: What I definitely realize that I should have talked at LSFMM2023 why I had taken that approach and should have reduced slides about the architecture and performance.
On 6/12/24 15:32, Bernd Schubert wrote: > On 6/12/24 09:39, Miklos Szeredi wrote: >>> Personally I don't see anything twisted here, one just needs to >>> understand that IORING_OP_URING_CMD was written for that reverse order. >> >> That's just my gut feeling. fuse/dev_uring.c is 1233 in this RFC. >> And that's just the queuing. > Btw, counting lines, majority of that is not queuing and handling requests, but setting up things, shutdown (start/stop is already almost half of the file) and then and doing sanity checks, as in fuse_uring_get_verify_queue().
On 6/12/24 01:35, Kent Overstreet wrote: > On Tue, Jun 11, 2024 at 07:37:30PM GMT, Bernd Schubert wrote: >> >> >> On 6/11/24 17:35, Miklos Szeredi wrote: >>> On Tue, 11 Jun 2024 at 12:26, Bernd Schubert <bernd.schubert@fastmail.fm> wrote: >>> >>>> Secondly, with IORING_OP_URING_CMD we already have only a single command >>>> to submit requests and fetch the next one - half of the system calls. >>>> >>>> Wouldn't IORING_OP_READV/IORING_OP_WRITEV be just this approach? >>>> https://github.com/uroni/fuseuring? >>>> I.e. it hook into the existing fuse and just changes from read()/write() >>>> of /dev/fuse to io-uring of /dev/fuse. With the disadvantage of zero >>>> control which ring/queue and which ring-entry handles the request. >>> >>> Unlike system calls, io_uring ops should have very little overhead. >>> That's one of the main selling points of io_uring (as described in the >>> io_uring(7) man page). >>> >>> So I don't think it matters to performance whether there's a combined >>> WRITEV + READV (or COMMIT + FETCH) op or separate ops. >> >> This has to be performance proven and is no means what I'm seeing. How >> should io-uring improve performance if you have the same number of >> system calls? >> >> As I see it (@Jens or @Pavel or anyone else please correct me if I'm >> wrong), advantage of io-uring comes when there is no syscall overhead at >> all - either you have a ring with multiple entries and then one side >> operates on multiple entries or you have polling and no syscall overhead >> either. We cannot afford cpu intensive polling - out of question, >> besides that I had even tried SQPOLL and it made things worse (that is >> actually where my idea about application polling comes from). >> As I see it, for sync blocking calls (like meta operations) with one >> entry in the queue, you would get no advantage with >> IORING_OP_READV/IORING_OP_WRITEV - io-uring has do two system calls - >> one to submit from kernel to userspace and another from userspace to >> kernel. Why should io-uring be faster there? >> >> And from my testing this is exactly what I had seen - io-uring for meta >> requests (i.e. without a large request queue and *without* core >> affinity) makes meta operations even slower that /dev/fuse. >> >> For anything that imposes a large ring queue and where either side >> (kernel or userspace) needs to process multiple ring entries - system >> call overhead gets reduced by the queue size. Just for DIO or meta >> operations that is hard to reach. >> >> Also, if you are using IORING_OP_READV/IORING_OP_WRITEV, nothing would >> change in fuse kernel? I.e. IOs would go via fuse_dev_read()? >> I.e. we would not have encoded in the request which queue it belongs to? > > Want to try out my new ringbuffer syscall? > > I haven't yet dug far into the fuse protocol or /dev/fuse code yet, only > skimmed. But using it to replace the read/write syscall overhead should > be straightforward; you'll want to spin up a kthread for responding to > requests. I will definitely look at it this week. Although I don't like the idea to have a new kthread. We already have an application thread and have the fuse server thread, why do we need another one? > > The next thing I was going to look at is how you guys are using splice, > we want to get away from that too. Well, Ming Lei is working on that for ublk_drv and I guess that new approach could be adapted as well onto the current way of io-uring. It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ > > Brian was also saying the fuse virtio_fs code may be worth > investigating, maybe that could be adapted? I need to check, but really, the majority of the new additions is just to set up things, shutdown and to have sanity checks. Request sending/completing to/from the ring is not that much new lines. Thanks, Bernd
On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote: > I didn't do that yet, as we are going to use the ring buffer for requests, > i.e. the ring buffer immediately gets all the data from network, there is > no copy. Even if the ring buffer would get data from local disk - there > is no need to use a separate application buffer anymore. And with that > there is just no extra copy Let's just tackle this shared request buffer, as it seems to be a central part of your design. You say the shared buffer is used to immediately get the data from the network (or various other sources), which is completely viable. And then the kernel will do the copy from the shared buffer. Single copy, fine. But if the buffer wasn't shared? What would be the difference? Single copy also. Why is the shared buffer better? I mean it may even be worse due to cache aliasing issues on certain architectures. copy_to_user() / copy_from_user() are pretty darn efficient. Why is it better to have that buffer managed by kernel? Being locked in memory (being unswappable) is probably a disadvantage as well. And if locking is required, it can be done on the user buffer. And there are all the setup and teardown complexities... Note: the ring buffer used by io_uring is different. It literally allows communication without invoking any system calls in certain cases. That shared buffer doesn't add anything like that. At least I don't see what it actually adds. Hmm? Thanks, Miklos
On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: > I will definitely look at it this week. Although I don't like the idea > to have a new kthread. We already have an application thread and have > the fuse server thread, why do we need another one? Ok, I hadn't found the fuse server thread - that should be fine. > > > > The next thing I was going to look at is how you guys are using splice, > > we want to get away from that too. > > Well, Ming Lei is working on that for ublk_drv and I guess that new approach > could be adapted as well onto the current way of io-uring. > It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. > > https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ > > > > > Brian was also saying the fuse virtio_fs code may be worth > > investigating, maybe that could be adapted? > > I need to check, but really, the majority of the new additions > is just to set up things, shutdown and to have sanity checks. > Request sending/completing to/from the ring is not that much new lines. What I'm wondering is how read/write requests are handled. Are the data payloads going in the same ringbuffer as the commands? That could work, if the ringbuffer is appropriately sized, but alignment is a an issue. We just looked up the device DMA requirements and with modern NVME only 4 byte alignment is required, but the block layer likely isn't set up to handle that. So - prearranged buffer? Or are you using splice to get pages that userspace has read into into the kernel pagecache?
On 6/12/24 16:07, Miklos Szeredi wrote: > On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote: > >> I didn't do that yet, as we are going to use the ring buffer for requests, >> i.e. the ring buffer immediately gets all the data from network, there is >> no copy. Even if the ring buffer would get data from local disk - there >> is no need to use a separate application buffer anymore. And with that >> there is just no extra copy > > Let's just tackle this shared request buffer, as it seems to be a > central part of your design. > > You say the shared buffer is used to immediately get the data from the > network (or various other sources), which is completely viable. > > And then the kernel will do the copy from the shared buffer. Single copy, fine. > > But if the buffer wasn't shared? What would be the difference? > Single copy also. > > Why is the shared buffer better? I mean it may even be worse due to > cache aliasing issues on certain architectures. copy_to_user() / > copy_from_user() are pretty darn efficient. Right now we have: - Application thread writes into the buffer, then calls io_uring_cmd_done I can try to do without mmap and set a pointer to the user buffer in the 80B section of the SQE. I'm not sure if the application is allowed to write into that buffer, possibly/probably we will be forced to use io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that anyway). My greatest fear here is that the extra task has performance implications for sync requests. > > Why is it better to have that buffer managed by kernel? Being locked > in memory (being unswappable) is probably a disadvantage as well. And > if locking is required, it can be done on the user buffer. Well, let me try to give the buffer in the 80B section. > > And there are all the setup and teardown complexities... If the buffer in the 80B section works setup becomes easier, mmap and ioctls go away. Teardown, well, we still need the workaround as we need to handle io_uring_cmd_done, but if you could live with that for the instance, I would ask Jens or Pavel or Ming for help if we could solve that in io-uring itself. Is the ring workaround in fuse_dev_release() acceptable for you? Or do you have any another idea about it? > > Note: the ring buffer used by io_uring is different. It literally > allows communication without invoking any system calls in certain > cases. That shared buffer doesn't add anything like that. At least I > don't see what it actually adds. > > Hmm? The application can write into the buffer. We won't shared queue buffers if we could solve the same with a user pointer. Thanks, Bernd
On 6/12/24 16:19, Kent Overstreet wrote: > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: >> I will definitely look at it this week. Although I don't like the idea >> to have a new kthread. We already have an application thread and have >> the fuse server thread, why do we need another one? > > Ok, I hadn't found the fuse server thread - that should be fine. > >>> >>> The next thing I was going to look at is how you guys are using splice, >>> we want to get away from that too. >> >> Well, Ming Lei is working on that for ublk_drv and I guess that new approach >> could be adapted as well onto the current way of io-uring. >> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. >> >> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ >> >>> >>> Brian was also saying the fuse virtio_fs code may be worth >>> investigating, maybe that could be adapted? >> >> I need to check, but really, the majority of the new additions >> is just to set up things, shutdown and to have sanity checks. >> Request sending/completing to/from the ring is not that much new lines. > > What I'm wondering is how read/write requests are handled. Are the data > payloads going in the same ringbuffer as the commands? That could work, > if the ringbuffer is appropriately sized, but alignment is a an issue. That is exactly the big discussion Miklos and I have. Basically in my series another buffer is vmalloced, mmaped and then assigned to ring entries. Fuse meta headers and application payload goes into that buffer. In both kernel/userspace directions. io-uring only allows 80B, so only a really small request would fit into it. Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse header - intrinsically fixed in the ring patches. I will now try without mmap and just provide a user buffer as pointer in the 80B section. > > We just looked up the device DMA requirements and with modern NVME only > 4 byte alignment is required, but the block layer likely isn't set up to > handle that. I think existing fuse headers have and their data have a 4 byte alignment. Maybe even 8 byte, I don't remember without looking through all request types. If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp without the ring patches it will fail because of alignment. Needs to be fixed in legacy fuse and would also avoid compat issues we had in libfuse when the kernel header was updated. > > So - prearranged buffer? Or are you using splice to get pages that > userspace has read into into the kernel pagecache? I didn't even try to use splice yet, because for the DDN (my employer) use case we cannot use zero copy, at least not without violating the rule that one cannot access the application buffer in userspace. I will definitely look into Mings work, as it will be useful for others. Cheers, Bernd
On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote: > On 6/12/24 16:19, Kent Overstreet wrote: > > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: > >> I will definitely look at it this week. Although I don't like the idea > >> to have a new kthread. We already have an application thread and have > >> the fuse server thread, why do we need another one? > > > > Ok, I hadn't found the fuse server thread - that should be fine. > > > >>> > >>> The next thing I was going to look at is how you guys are using splice, > >>> we want to get away from that too. > >> > >> Well, Ming Lei is working on that for ublk_drv and I guess that new approach > >> could be adapted as well onto the current way of io-uring. > >> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. > >> > >> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ > >> > >>> > >>> Brian was also saying the fuse virtio_fs code may be worth > >>> investigating, maybe that could be adapted? > >> > >> I need to check, but really, the majority of the new additions > >> is just to set up things, shutdown and to have sanity checks. > >> Request sending/completing to/from the ring is not that much new lines. > > > > What I'm wondering is how read/write requests are handled. Are the data > > payloads going in the same ringbuffer as the commands? That could work, > > if the ringbuffer is appropriately sized, but alignment is a an issue. > > That is exactly the big discussion Miklos and I have. Basically in my > series another buffer is vmalloced, mmaped and then assigned to ring entries. > Fuse meta headers and application payload goes into that buffer. > In both kernel/userspace directions. io-uring only allows 80B, so only a > really small request would fit into it. Well, the generic ringbuffer would lift that restriction. > Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse > header - intrinsically fixed in the ring patches. *nod* That's the big question, put the data inline (with potential alignment hassles) or manage (and map) a separate data structure. Maybe padding could be inserted to solve alignment? A separate data structure would only really be useful if it enabled zero copy, but that should probably be a secondary enhancement. > I will now try without mmap and just provide a user buffer as pointer in the 80B > section. > > > > > > We just looked up the device DMA requirements and with modern NVME only > > 4 byte alignment is required, but the block layer likely isn't set up to > > handle that. > > I think existing fuse headers have and their data have a 4 byte alignment. > Maybe even 8 byte, I don't remember without looking through all request types. > If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp > without the ring patches it will fail because of alignment. Needs to be fixed > in legacy fuse and would also avoid compat issues we had in libfuse when the > kernel header was updated. > > > > > So - prearranged buffer? Or are you using splice to get pages that > > userspace has read into into the kernel pagecache? > > I didn't even try to use splice yet, because for the DDN (my employer) use case > we cannot use zero copy, at least not without violating the rule that one > cannot access the application buffer in userspace. DDN - lustre related? > > I will definitely look into Mings work, as it will be useful for others. > > > Cheers, > Bernd
On 6/12/24 17:55, Kent Overstreet wrote: > On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote: >> On 6/12/24 16:19, Kent Overstreet wrote: >>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: >>>> I will definitely look at it this week. Although I don't like the idea >>>> to have a new kthread. We already have an application thread and have >>>> the fuse server thread, why do we need another one? >>> >>> Ok, I hadn't found the fuse server thread - that should be fine. >>> >>>>> >>>>> The next thing I was going to look at is how you guys are using splice, >>>>> we want to get away from that too. >>>> >>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach >>>> could be adapted as well onto the current way of io-uring. >>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. >>>> >>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ >>>> >>>>> >>>>> Brian was also saying the fuse virtio_fs code may be worth >>>>> investigating, maybe that could be adapted? >>>> >>>> I need to check, but really, the majority of the new additions >>>> is just to set up things, shutdown and to have sanity checks. >>>> Request sending/completing to/from the ring is not that much new lines. >>> >>> What I'm wondering is how read/write requests are handled. Are the data >>> payloads going in the same ringbuffer as the commands? That could work, >>> if the ringbuffer is appropriately sized, but alignment is a an issue. >> >> That is exactly the big discussion Miklos and I have. Basically in my >> series another buffer is vmalloced, mmaped and then assigned to ring entries. >> Fuse meta headers and application payload goes into that buffer. >> In both kernel/userspace directions. io-uring only allows 80B, so only a >> really small request would fit into it. > > Well, the generic ringbuffer would lift that restriction. Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated in that code. At least all that setup code would be moved out of fuse. I will eventually come to your patches today. Now we only need to convince Miklos that your ring is better ;) > >> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse >> header - intrinsically fixed in the ring patches. > > *nod* > > That's the big question, put the data inline (with potential alignment > hassles) or manage (and map) a separate data structure. > > Maybe padding could be inserted to solve alignment? Right now I have this struct: struct fuse_ring_req { union { /* The first 4K are command data */ char ring_header[FUSE_RING_HEADER_BUF_SIZE]; struct { uint64_t flags; /* enum fuse_ring_buf_cmd */ uint32_t in_out_arg_len; uint32_t padding; /* kernel fills in, reads out */ union { struct fuse_in_header in; struct fuse_out_header out; }; }; }; char in_out_arg[]; }; Data go into in_out_arg, i.e. headers are padded by the union. I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size and not a fixed 4K. (I just see the stale comment 'enum fuse_ring_buf_cmd', will remove it in the next series) > > A separate data structure would only really be useful if it enabled zero > copy, but that should probably be a secondary enhancement. > >> I will now try without mmap and just provide a user buffer as pointer in the 80B >> section. >> >> >>> >>> We just looked up the device DMA requirements and with modern NVME only >>> 4 byte alignment is required, but the block layer likely isn't set up to >>> handle that. >> >> I think existing fuse headers have and their data have a 4 byte alignment. >> Maybe even 8 byte, I don't remember without looking through all request types. >> If you try a simple O_DIRECT read/write to libfuse/example_passthrough_hp >> without the ring patches it will fail because of alignment. Needs to be fixed >> in legacy fuse and would also avoid compat issues we had in libfuse when the >> kernel header was updated. >> >>> >>> So - prearranged buffer? Or are you using splice to get pages that >>> userspace has read into into the kernel pagecache? >> >> I didn't even try to use splice yet, because for the DDN (my employer) use case >> we cannot use zero copy, at least not without violating the rule that one >> cannot access the application buffer in userspace. > > DDN - lustre related? I have bit of ancient Lustre background, also with DDN, then went to Fraunhofer for FhGFS/BeeGFS (kind of competing with Lustre). Back at DDN initially on IME (burst buffer) and now Infinia. Lustre is mostly HPC only, Infina is kind of everything.
On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote: > > > On 6/12/24 17:55, Kent Overstreet wrote: > > On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote: > > > On 6/12/24 16:19, Kent Overstreet wrote: > > > > On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: > > > > > I will definitely look at it this week. Although I don't like the idea > > > > > to have a new kthread. We already have an application thread and have > > > > > the fuse server thread, why do we need another one? > > > > > > > > Ok, I hadn't found the fuse server thread - that should be fine. > > > > > > > > > > > > > > > > The next thing I was going to look at is how you guys are using splice, > > > > > > we want to get away from that too. > > > > > > > > > > Well, Ming Lei is working on that for ublk_drv and I guess that new approach > > > > > could be adapted as well onto the current way of io-uring. > > > > > It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. > > > > > > > > > > https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ > > > > > > > > > > > > > > > > > Brian was also saying the fuse virtio_fs code may be worth > > > > > > investigating, maybe that could be adapted? > > > > > > > > > > I need to check, but really, the majority of the new additions > > > > > is just to set up things, shutdown and to have sanity checks. > > > > > Request sending/completing to/from the ring is not that much new lines. > > > > > > > > What I'm wondering is how read/write requests are handled. Are the data > > > > payloads going in the same ringbuffer as the commands? That could work, > > > > if the ringbuffer is appropriately sized, but alignment is a an issue. > > > > > > That is exactly the big discussion Miklos and I have. Basically in my > > > series another buffer is vmalloced, mmaped and then assigned to ring entries. > > > Fuse meta headers and application payload goes into that buffer. > > > In both kernel/userspace directions. io-uring only allows 80B, so only a > > > really small request would fit into it. > > > > Well, the generic ringbuffer would lift that restriction. > > Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated > in that code. At least all that setup code would be moved out of fuse. I will > eventually come to your patches today. > Now we only need to convince Miklos that your ring is better ;) > > > > > > Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse > > > header - intrinsically fixed in the ring patches. > > > > *nod* > > > > That's the big question, put the data inline (with potential alignment > > hassles) or manage (and map) a separate data structure. > > > > Maybe padding could be inserted to solve alignment? > > Right now I have this struct: > > struct fuse_ring_req { > union { > /* The first 4K are command data */ > char ring_header[FUSE_RING_HEADER_BUF_SIZE]; > > struct { > uint64_t flags; > > /* enum fuse_ring_buf_cmd */ > uint32_t in_out_arg_len; > uint32_t padding; > > /* kernel fills in, reads out */ > union { > struct fuse_in_header in; > struct fuse_out_header out; > }; > }; > }; > > char in_out_arg[]; > }; > > > Data go into in_out_arg, i.e. headers are padded by the union. > I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size > and not a fixed 4K. I would make the commands variable sized, so that commands with no data buffers don't need padding, and then when you do have a data command you only pad out that specific command so that the data buffer starts on a page boundary.
On 6/12/24 18:24, Kent Overstreet wrote: > On Wed, Jun 12, 2024 at 06:15:57PM GMT, Bernd Schubert wrote: >> >> >> On 6/12/24 17:55, Kent Overstreet wrote: >>> On Wed, Jun 12, 2024 at 03:40:14PM GMT, Bernd Schubert wrote: >>>> On 6/12/24 16:19, Kent Overstreet wrote: >>>>> On Wed, Jun 12, 2024 at 03:53:42PM GMT, Bernd Schubert wrote: >>>>>> I will definitely look at it this week. Although I don't like the idea >>>>>> to have a new kthread. We already have an application thread and have >>>>>> the fuse server thread, why do we need another one? >>>>> >>>>> Ok, I hadn't found the fuse server thread - that should be fine. >>>>> >>>>>>> >>>>>>> The next thing I was going to look at is how you guys are using splice, >>>>>>> we want to get away from that too. >>>>>> >>>>>> Well, Ming Lei is working on that for ublk_drv and I guess that new approach >>>>>> could be adapted as well onto the current way of io-uring. >>>>>> It _probably_ wouldn't work with IORING_OP_READV/IORING_OP_WRITEV. >>>>>> >>>>>> https://lore.gnuweeb.org/io-uring/20240511001214.173711-6-ming.lei@redhat.com/T/ >>>>>> >>>>>>> >>>>>>> Brian was also saying the fuse virtio_fs code may be worth >>>>>>> investigating, maybe that could be adapted? >>>>>> >>>>>> I need to check, but really, the majority of the new additions >>>>>> is just to set up things, shutdown and to have sanity checks. >>>>>> Request sending/completing to/from the ring is not that much new lines. >>>>> >>>>> What I'm wondering is how read/write requests are handled. Are the data >>>>> payloads going in the same ringbuffer as the commands? That could work, >>>>> if the ringbuffer is appropriately sized, but alignment is a an issue. >>>> >>>> That is exactly the big discussion Miklos and I have. Basically in my >>>> series another buffer is vmalloced, mmaped and then assigned to ring entries. >>>> Fuse meta headers and application payload goes into that buffer. >>>> In both kernel/userspace directions. io-uring only allows 80B, so only a >>>> really small request would fit into it. >>> >>> Well, the generic ringbuffer would lift that restriction. >> >> Yeah, kind of. Instead allocating the buffer in fuse, it would be now allocated >> in that code. At least all that setup code would be moved out of fuse. I will >> eventually come to your patches today. >> Now we only need to convince Miklos that your ring is better ;) >> >>> >>>> Legacy /dev/fuse has an alignment issue as payload follows directly as the fuse >>>> header - intrinsically fixed in the ring patches. >>> >>> *nod* >>> >>> That's the big question, put the data inline (with potential alignment >>> hassles) or manage (and map) a separate data structure. >>> >>> Maybe padding could be inserted to solve alignment? >> >> Right now I have this struct: >> >> struct fuse_ring_req { >> union { >> /* The first 4K are command data */ >> char ring_header[FUSE_RING_HEADER_BUF_SIZE]; >> >> struct { >> uint64_t flags; >> >> /* enum fuse_ring_buf_cmd */ >> uint32_t in_out_arg_len; >> uint32_t padding; >> >> /* kernel fills in, reads out */ >> union { >> struct fuse_in_header in; >> struct fuse_out_header out; >> }; >> }; >> }; >> >> char in_out_arg[]; >> }; >> >> >> Data go into in_out_arg, i.e. headers are padded by the union. >> I actually wonder if FUSE_RING_HEADER_BUF_SIZE should be page size >> and not a fixed 4K. > > I would make the commands variable sized, so that commands with no data > buffers don't need padding, and then when you do have a data command you > only pad out that specific command so that the data buffer starts on a > page boundary. The same buffer is used for kernel to userspace and the other way around - it is attached to the ring entry. Either direction will always have data, where would a dynamic sizing then be useful? Well, some "data" like the node id don't need to be aligned - we could save memory for that. I still would like to have some padding so that headers could be grown without any kind of compat issues. Though almost 4K is probably too much for that. Thanks for pointing it out, will improve it! Cheers, Bernd
On 6/12/24 16:56, Bernd Schubert wrote: > On 6/12/24 16:07, Miklos Szeredi wrote: >> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote: >> >>> I didn't do that yet, as we are going to use the ring buffer for requests, >>> i.e. the ring buffer immediately gets all the data from network, there is >>> no copy. Even if the ring buffer would get data from local disk - there >>> is no need to use a separate application buffer anymore. And with that >>> there is just no extra copy >> >> Let's just tackle this shared request buffer, as it seems to be a >> central part of your design. >> >> You say the shared buffer is used to immediately get the data from the >> network (or various other sources), which is completely viable. >> >> And then the kernel will do the copy from the shared buffer. Single copy, fine. >> >> But if the buffer wasn't shared? What would be the difference? >> Single copy also. >> >> Why is the shared buffer better? I mean it may even be worse due to >> cache aliasing issues on certain architectures. copy_to_user() / >> copy_from_user() are pretty darn efficient. > > Right now we have: > > - Application thread writes into the buffer, then calls io_uring_cmd_done > > I can try to do without mmap and set a pointer to the user buffer in the > 80B section of the SQE. I'm not sure if the application is allowed to > write into that buffer, possibly/probably we will be forced to use > io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that > anyway). My greatest fear here is that the extra task has performance > implications for sync requests. > > >> >> Why is it better to have that buffer managed by kernel? Being locked >> in memory (being unswappable) is probably a disadvantage as well. And >> if locking is required, it can be done on the user buffer. > > Well, let me try to give the buffer in the 80B section. > >> >> And there are all the setup and teardown complexities... > > If the buffer in the 80B section works setup becomes easier, mmap and > ioctls go away. Teardown, well, we still need the workaround as we need > to handle io_uring_cmd_done, but if you could live with that for the > instance, I would ask Jens or Pavel or Ming for help if we could solve > that in io-uring itself. > Is the ring workaround in fuse_dev_release() acceptable for you? Or do > you have any another idea about it? > >> Short update, I have this working for some time now with a hack patch that just adds in a user buffer (without removing mmap, it is just unused). Initially I thought that is a lot slower, but after removing all the kernel debug options perf loss is just around 5% and I think I can get back the remaining by having iov_iter_get_pages2() of the user buffer in the initialization (with additional code overhead). I hope to have new patches by mid of next week. I also want to get rid of the difference of buffer layout between uring and /dev/fuse as that can be troublesome for other changes like alignment. That might require an io-uring CQE128, though. Thanks, Bernd
[I shortened the CC list as that long came up only due to mmap and optimizations] On 6/12/24 16:56, Bernd Schubert wrote: > On 6/12/24 16:07, Miklos Szeredi wrote: >> On Wed, 12 Jun 2024 at 15:33, Bernd Schubert <bschubert@ddn.com> wrote: >> >>> I didn't do that yet, as we are going to use the ring buffer for requests, >>> i.e. the ring buffer immediately gets all the data from network, there is >>> no copy. Even if the ring buffer would get data from local disk - there >>> is no need to use a separate application buffer anymore. And with that >>> there is just no extra copy >> >> Let's just tackle this shared request buffer, as it seems to be a >> central part of your design. >> >> You say the shared buffer is used to immediately get the data from the >> network (or various other sources), which is completely viable. >> >> And then the kernel will do the copy from the shared buffer. Single copy, fine. >> >> But if the buffer wasn't shared? What would be the difference? >> Single copy also. >> >> Why is the shared buffer better? I mean it may even be worse due to >> cache aliasing issues on certain architectures. copy_to_user() / >> copy_from_user() are pretty darn efficient. > > Right now we have: > > - Application thread writes into the buffer, then calls io_uring_cmd_done > > I can try to do without mmap and set a pointer to the user buffer in the > 80B section of the SQE. I'm not sure if the application is allowed to > write into that buffer, possibly/probably we will be forced to use > io_uring_cmd_complete_in_task() in all cases (without 19/19 we have that > anyway). My greatest fear here is that the extra task has performance > implications for sync requests. > > >> >> Why is it better to have that buffer managed by kernel? Being locked >> in memory (being unswappable) is probably a disadvantage as well. And >> if locking is required, it can be done on the user buffer. > > Well, let me try to give the buffer in the 80B section. > >> >> And there are all the setup and teardown complexities... > > If the buffer in the 80B section works setup becomes easier, mmap and > ioctls go away. Teardown, well, we still need the workaround as we need > to handle io_uring_cmd_done, but if you could live with that for the > instance, I would ask Jens or Pavel or Ming for help if we could solve > that in io-uring itself. > Is the ring workaround in fuse_dev_release() acceptable for you? Or do > you have any another idea about it? > >> >> Note: the ring buffer used by io_uring is different. It literally >> allows communication without invoking any system calls in certain >> cases. That shared buffer doesn't add anything like that. At least I >> don't see what it actually adds. >> >> Hmm? > > The application can write into the buffer. We won't shared queue buffers > if we could solve the same with a user pointer. Wanted to send out a new series today, https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap but then just noticed a tear down issue. 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7] [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G O 6.10.0+ #48 [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 1525.922470] Workqueue: events io_fallback_req_func [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80 [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7 [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002 [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001 [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0 [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000 [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001 [ 1525.968225] FS: 0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000 [ 1525.973932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0 [ 1525.980030] Call Trace: [ 1525.981371] <TASK> [ 1525.982567] ? __die_body+0x66/0xb0 [ 1525.984376] ? die_addr+0xc1/0x100 [ 1525.986111] ? exc_general_protection+0x1c6/0x330 [ 1525.988401] ? asm_exc_general_protection+0x22/0x30 [ 1525.990864] ? __lock_acquire+0x74/0x7b80 [ 1525.992901] ? mark_lock+0x9f/0x360 [ 1525.994635] ? __lock_acquire+0x1420/0x7b80 [ 1525.996629] ? attach_entity_load_avg+0x47d/0x550 [ 1525.998765] ? hlock_conflict+0x5a/0x1f0 [ 1526.000515] ? __bfs+0x2dc/0x5a0 [ 1526.001993] lock_acquire+0x1fb/0x3d0 [ 1526.004727] ? gup_fast_fallback+0x13f/0x1d80 [ 1526.006586] ? gup_fast_fallback+0x13f/0x1d80 [ 1526.008412] gup_fast_fallback+0x158/0x1d80 [ 1526.010170] ? gup_fast_fallback+0x13f/0x1d80 [ 1526.011999] ? __lock_acquire+0x2b07/0x7b80 [ 1526.013793] __iov_iter_get_pages_alloc+0x36e/0x980 [ 1526.015876] ? do_raw_spin_unlock+0x5a/0x8a0 [ 1526.017734] iov_iter_get_pages2+0x56/0x70 [ 1526.019491] fuse_copy_fill+0x48e/0x980 [fuse] [ 1526.021400] fuse_copy_args+0x174/0x6a0 [fuse] [ 1526.023199] fuse_uring_prepare_send+0x319/0x6c0 [fuse] [ 1526.025178] fuse_uring_send_req_in_task+0x42/0x100 [fuse] [ 1526.027163] io_fallback_req_func+0xb4/0x170 [ 1526.028737] ? process_scheduled_works+0x75b/0x1160 [ 1526.030445] process_scheduled_works+0x85c/0x1160 [ 1526.032073] worker_thread+0x8ba/0xce0 [ 1526.033388] kthread+0x23e/0x2b0 [ 1526.035404] ? pr_cont_work_flush+0x290/0x290 [ 1526.036958] ? kthread_blkcg+0xa0/0xa0 [ 1526.038321] ret_from_fork+0x30/0x60 [ 1526.039600] ? kthread_blkcg+0xa0/0xa0 [ 1526.040942] ret_from_fork_asm+0x11/0x20 [ 1526.042353] </TASK> We probably need to call iov_iter_get_pages2() immediately on submitting the buffer from fuse server and not only when needed. I had planned to do that as optimization later on, I think it is also needed to avoid io_uring_cmd_complete_in_task(). The part I don't like here is that with mmap we had a complex initialization - but then either it worked or did not. No exceptions at IO time. And run time was just a copy into the buffer. Without mmap initialization is much simpler, but now complexity shifts to IO time. Thanks, Bernd
On 8/29/24 4:32 PM, Bernd Schubert wrote: > Wanted to send out a new series today, > > https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap > > but then just noticed a tear down issue. > > 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7] > [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G O 6.10.0+ #48 > [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 > [ 1525.922470] Workqueue: events io_fallback_req_func > [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80 > [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7 > [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002 > [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001 > [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0 > [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 > [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000 > [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001 > [ 1525.968225] FS: 0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000 > [ 1525.973932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0 > [ 1525.980030] Call Trace: > [ 1525.981371] <TASK> > [ 1525.982567] ? __die_body+0x66/0xb0 > [ 1525.984376] ? die_addr+0xc1/0x100 > [ 1525.986111] ? exc_general_protection+0x1c6/0x330 > [ 1525.988401] ? asm_exc_general_protection+0x22/0x30 > [ 1525.990864] ? __lock_acquire+0x74/0x7b80 > [ 1525.992901] ? mark_lock+0x9f/0x360 > [ 1525.994635] ? __lock_acquire+0x1420/0x7b80 > [ 1525.996629] ? attach_entity_load_avg+0x47d/0x550 > [ 1525.998765] ? hlock_conflict+0x5a/0x1f0 > [ 1526.000515] ? __bfs+0x2dc/0x5a0 > [ 1526.001993] lock_acquire+0x1fb/0x3d0 > [ 1526.004727] ? gup_fast_fallback+0x13f/0x1d80 > [ 1526.006586] ? gup_fast_fallback+0x13f/0x1d80 > [ 1526.008412] gup_fast_fallback+0x158/0x1d80 > [ 1526.010170] ? gup_fast_fallback+0x13f/0x1d80 > [ 1526.011999] ? __lock_acquire+0x2b07/0x7b80 > [ 1526.013793] __iov_iter_get_pages_alloc+0x36e/0x980 > [ 1526.015876] ? do_raw_spin_unlock+0x5a/0x8a0 > [ 1526.017734] iov_iter_get_pages2+0x56/0x70 > [ 1526.019491] fuse_copy_fill+0x48e/0x980 [fuse] > [ 1526.021400] fuse_copy_args+0x174/0x6a0 [fuse] > [ 1526.023199] fuse_uring_prepare_send+0x319/0x6c0 [fuse] > [ 1526.025178] fuse_uring_send_req_in_task+0x42/0x100 [fuse] > [ 1526.027163] io_fallback_req_func+0xb4/0x170 > [ 1526.028737] ? process_scheduled_works+0x75b/0x1160 > [ 1526.030445] process_scheduled_works+0x85c/0x1160 > [ 1526.032073] worker_thread+0x8ba/0xce0 > [ 1526.033388] kthread+0x23e/0x2b0 > [ 1526.035404] ? pr_cont_work_flush+0x290/0x290 > [ 1526.036958] ? kthread_blkcg+0xa0/0xa0 > [ 1526.038321] ret_from_fork+0x30/0x60 > [ 1526.039600] ? kthread_blkcg+0xa0/0xa0 > [ 1526.040942] ret_from_fork_asm+0x11/0x20 > [ 1526.042353] </TASK> > > > We probably need to call iov_iter_get_pages2() immediately > on submitting the buffer from fuse server and not only when needed. > I had planned to do that as optimization later on, I think > it is also needed to avoid io_uring_cmd_complete_in_task(). I think you do, but it's not really what's wrong here - fallback work is being invoked as the ring is being torn down, either directly or because the task is exiting. Your task_work should check if this is the case, and just do -ECANCELED for this case rather than attempt to execute the work. Most task_work doesn't do much outside of post a completion, but yours seems complex in that attempts to map pages as well, for example. In any case, regardless of whether you move the gup to the actual issue side of things (which I think you should), then you'd want something ala: if (req->task != current) don't issue, -ECANCELED in your task_work. > The part I don't like here is that with mmap we had a complex > initialization - but then either it worked or did not. No exceptions > at IO time. And run time was just a copy into the buffer. > Without mmap initialization is much simpler, but now complexity shifts > to IO time. I'll take a look at your code. But I'd say just fix the missing check above and send out what you have, it's much easier to iterate on the list rather than poking at patches in some git branch somewhere.
On 8/30/24 15:12, Jens Axboe wrote: > On 8/29/24 4:32 PM, Bernd Schubert wrote: >> Wanted to send out a new series today, >> >> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap >> >> but then just noticed a tear down issue. >> >> 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7] >> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G O 6.10.0+ #48 >> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >> [ 1525.922470] Workqueue: events io_fallback_req_func >> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80 >> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7 >> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002 >> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001 >> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0 >> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 >> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000 >> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001 >> [ 1525.968225] FS: 0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000 >> [ 1525.973932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0 >> [ 1525.980030] Call Trace: >> [ 1525.981371] <TASK> >> [ 1525.982567] ? __die_body+0x66/0xb0 >> [ 1525.984376] ? die_addr+0xc1/0x100 >> [ 1525.986111] ? exc_general_protection+0x1c6/0x330 >> [ 1525.988401] ? asm_exc_general_protection+0x22/0x30 >> [ 1525.990864] ? __lock_acquire+0x74/0x7b80 >> [ 1525.992901] ? mark_lock+0x9f/0x360 >> [ 1525.994635] ? __lock_acquire+0x1420/0x7b80 >> [ 1525.996629] ? attach_entity_load_avg+0x47d/0x550 >> [ 1525.998765] ? hlock_conflict+0x5a/0x1f0 >> [ 1526.000515] ? __bfs+0x2dc/0x5a0 >> [ 1526.001993] lock_acquire+0x1fb/0x3d0 >> [ 1526.004727] ? gup_fast_fallback+0x13f/0x1d80 >> [ 1526.006586] ? gup_fast_fallback+0x13f/0x1d80 >> [ 1526.008412] gup_fast_fallback+0x158/0x1d80 >> [ 1526.010170] ? gup_fast_fallback+0x13f/0x1d80 >> [ 1526.011999] ? __lock_acquire+0x2b07/0x7b80 >> [ 1526.013793] __iov_iter_get_pages_alloc+0x36e/0x980 >> [ 1526.015876] ? do_raw_spin_unlock+0x5a/0x8a0 >> [ 1526.017734] iov_iter_get_pages2+0x56/0x70 >> [ 1526.019491] fuse_copy_fill+0x48e/0x980 [fuse] >> [ 1526.021400] fuse_copy_args+0x174/0x6a0 [fuse] >> [ 1526.023199] fuse_uring_prepare_send+0x319/0x6c0 [fuse] >> [ 1526.025178] fuse_uring_send_req_in_task+0x42/0x100 [fuse] >> [ 1526.027163] io_fallback_req_func+0xb4/0x170 >> [ 1526.028737] ? process_scheduled_works+0x75b/0x1160 >> [ 1526.030445] process_scheduled_works+0x85c/0x1160 >> [ 1526.032073] worker_thread+0x8ba/0xce0 >> [ 1526.033388] kthread+0x23e/0x2b0 >> [ 1526.035404] ? pr_cont_work_flush+0x290/0x290 >> [ 1526.036958] ? kthread_blkcg+0xa0/0xa0 >> [ 1526.038321] ret_from_fork+0x30/0x60 >> [ 1526.039600] ? kthread_blkcg+0xa0/0xa0 >> [ 1526.040942] ret_from_fork_asm+0x11/0x20 >> [ 1526.042353] </TASK> >> >> >> We probably need to call iov_iter_get_pages2() immediately >> on submitting the buffer from fuse server and not only when needed. >> I had planned to do that as optimization later on, I think >> it is also needed to avoid io_uring_cmd_complete_in_task(). > > I think you do, but it's not really what's wrong here - fallback work is > being invoked as the ring is being torn down, either directly or because > the task is exiting. Your task_work should check if this is the case, > and just do -ECANCELED for this case rather than attempt to execute the > work. Most task_work doesn't do much outside of post a completion, but > yours seems complex in that attempts to map pages as well, for example. > In any case, regardless of whether you move the gup to the actual issue > side of things (which I think you should), then you'd want something > ala: > > if (req->task != current) > don't issue, -ECANCELED > > in your task_work. Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong into __io_uring_cmd_do_in_task then? Because my task_work_cb function (passed to io_uring_cmd_complete_in_task) doesn't even have the request. I'm going to test this in a bit diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 21ac5fb2d5f0..c06b9fcff48f 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts) { struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); + if (req->task != current) { + /* don't issue, -ECANCELED */ + return; + } + /* task_work executor checks the deffered list completion */ ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER); } > >> The part I don't like here is that with mmap we had a complex >> initialization - but then either it worked or did not. No exceptions >> at IO time. And run time was just a copy into the buffer. >> Without mmap initialization is much simpler, but now complexity shifts >> to IO time. > > I'll take a look at your code. But I'd say just fix the missing check > above and send out what you have, it's much easier to iterate on the > list rather than poking at patches in some git branch somewhere. > I'm almost through updating it, will send something out definitely today. I will just keep the last patch that pins user buffer pages on top of the series - will avoid all the rebasing. Thanks, Bernd
On 8/30/24 7:28 AM, Bernd Schubert wrote: > On 8/30/24 15:12, Jens Axboe wrote: >> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>> Wanted to send out a new series today, >>> >>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.10-rfc3-without-mmap >>> >>> but then just noticed a tear down issue. >>> >>> 1525.905504] KASAN: null-ptr-deref in range [0x00000000000001a0-0x00000000000001a7] >>> [ 1525.910431] CPU: 15 PID: 183 Comm: kworker/15:1 Tainted: G O 6.10.0+ #48 >>> [ 1525.916449] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 >>> [ 1525.922470] Workqueue: events io_fallback_req_func >>> [ 1525.925840] RIP: 0010:__lock_acquire+0x74/0x7b80 >>> [ 1525.929010] Code: 89 bc 24 80 00 00 00 0f 85 1c 5f 00 00 83 3d 6e 80 b0 02 00 0f 84 1d 12 00 00 83 3d 65 c7 67 02 00 74 27 48 89 f8 48 c1 e8 03 <42> 80 3c 30 00 74 0d e8 50 44 42 00 48 8b bc 24 80 00 00 00 48 c7 >>> [ 1525.942211] RSP: 0018:ffff88810b2af490 EFLAGS: 00010002 >>> [ 1525.945672] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000001 >>> [ 1525.950421] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000000001a0 >>> [ 1525.955200] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 >>> [ 1525.959979] R10: dffffc0000000000 R11: fffffbfff07b1cbe R12: 0000000000000000 >>> [ 1525.964252] R13: 0000000000000001 R14: dffffc0000000000 R15: 0000000000000001 >>> [ 1525.968225] FS: 0000000000000000(0000) GS:ffff88875b200000(0000) knlGS:0000000000000000 >>> [ 1525.973932] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [ 1525.976694] CR2: 00005555b6a381f0 CR3: 000000012f5f1000 CR4: 00000000000006f0 >>> [ 1525.980030] Call Trace: >>> [ 1525.981371] <TASK> >>> [ 1525.982567] ? __die_body+0x66/0xb0 >>> [ 1525.984376] ? die_addr+0xc1/0x100 >>> [ 1525.986111] ? exc_general_protection+0x1c6/0x330 >>> [ 1525.988401] ? asm_exc_general_protection+0x22/0x30 >>> [ 1525.990864] ? __lock_acquire+0x74/0x7b80 >>> [ 1525.992901] ? mark_lock+0x9f/0x360 >>> [ 1525.994635] ? __lock_acquire+0x1420/0x7b80 >>> [ 1525.996629] ? attach_entity_load_avg+0x47d/0x550 >>> [ 1525.998765] ? hlock_conflict+0x5a/0x1f0 >>> [ 1526.000515] ? __bfs+0x2dc/0x5a0 >>> [ 1526.001993] lock_acquire+0x1fb/0x3d0 >>> [ 1526.004727] ? gup_fast_fallback+0x13f/0x1d80 >>> [ 1526.006586] ? gup_fast_fallback+0x13f/0x1d80 >>> [ 1526.008412] gup_fast_fallback+0x158/0x1d80 >>> [ 1526.010170] ? gup_fast_fallback+0x13f/0x1d80 >>> [ 1526.011999] ? __lock_acquire+0x2b07/0x7b80 >>> [ 1526.013793] __iov_iter_get_pages_alloc+0x36e/0x980 >>> [ 1526.015876] ? do_raw_spin_unlock+0x5a/0x8a0 >>> [ 1526.017734] iov_iter_get_pages2+0x56/0x70 >>> [ 1526.019491] fuse_copy_fill+0x48e/0x980 [fuse] >>> [ 1526.021400] fuse_copy_args+0x174/0x6a0 [fuse] >>> [ 1526.023199] fuse_uring_prepare_send+0x319/0x6c0 [fuse] >>> [ 1526.025178] fuse_uring_send_req_in_task+0x42/0x100 [fuse] >>> [ 1526.027163] io_fallback_req_func+0xb4/0x170 >>> [ 1526.028737] ? process_scheduled_works+0x75b/0x1160 >>> [ 1526.030445] process_scheduled_works+0x85c/0x1160 >>> [ 1526.032073] worker_thread+0x8ba/0xce0 >>> [ 1526.033388] kthread+0x23e/0x2b0 >>> [ 1526.035404] ? pr_cont_work_flush+0x290/0x290 >>> [ 1526.036958] ? kthread_blkcg+0xa0/0xa0 >>> [ 1526.038321] ret_from_fork+0x30/0x60 >>> [ 1526.039600] ? kthread_blkcg+0xa0/0xa0 >>> [ 1526.040942] ret_from_fork_asm+0x11/0x20 >>> [ 1526.042353] </TASK> >>> >>> >>> We probably need to call iov_iter_get_pages2() immediately >>> on submitting the buffer from fuse server and not only when needed. >>> I had planned to do that as optimization later on, I think >>> it is also needed to avoid io_uring_cmd_complete_in_task(). >> >> I think you do, but it's not really what's wrong here - fallback work is >> being invoked as the ring is being torn down, either directly or because >> the task is exiting. Your task_work should check if this is the case, >> and just do -ECANCELED for this case rather than attempt to execute the >> work. Most task_work doesn't do much outside of post a completion, but >> yours seems complex in that attempts to map pages as well, for example. >> In any case, regardless of whether you move the gup to the actual issue >> side of things (which I think you should), then you'd want something >> ala: >> >> if (req->task != current) >> don't issue, -ECANCELED >> >> in your task_work. > > Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong > into __io_uring_cmd_do_in_task then? Because my task_work_cb function > (passed to io_uring_cmd_complete_in_task) doesn't even have the request. Yeah it probably does, the uring_cmd case is a bit special is that it's a set of helpers around task_work that can be consumed by eg fuse and ublk. The existing users don't really do anything complicated on that side, hence there's no real need to check. But since the ring/task is going away, we should be able to generically do it in the helpers like you did below. > I'm going to test this in a bit > > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 21ac5fb2d5f0..c06b9fcff48f 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -120,6 +120,11 @@ static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts) > { > struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); > > + if (req->task != current) { > + /* don't issue, -ECANCELED */ > + return; > + } > + > /* task_work executor checks the deffered list completion */ > ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER); > } > > > >> >>> The part I don't like here is that with mmap we had a complex >>> initialization - but then either it worked or did not. No exceptions >>> at IO time. And run time was just a copy into the buffer. >>> Without mmap initialization is much simpler, but now complexity shifts >>> to IO time. >> >> I'll take a look at your code. But I'd say just fix the missing check >> above and send out what you have, it's much easier to iterate on the >> list rather than poking at patches in some git branch somewhere. >> > > I'm almost through updating it, will send something out definitely > today. I will just keep the last patch that pins user buffer pages on > top of the series - will avoid all the rebasing. Excellent! Would be great if you can include how to test it as well, as then I can give it a spin as well and test any potential code changes proposed.
On 8/30/24 14:33, Jens Axboe wrote: > On 8/30/24 7:28 AM, Bernd Schubert wrote: >> On 8/30/24 15:12, Jens Axboe wrote: >>> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>>> We probably need to call iov_iter_get_pages2() immediately >>>> on submitting the buffer from fuse server and not only when needed. >>>> I had planned to do that as optimization later on, I think >>>> it is also needed to avoid io_uring_cmd_complete_in_task(). >>> >>> I think you do, but it's not really what's wrong here - fallback work is >>> being invoked as the ring is being torn down, either directly or because >>> the task is exiting. Your task_work should check if this is the case, >>> and just do -ECANCELED for this case rather than attempt to execute the >>> work. Most task_work doesn't do much outside of post a completion, but >>> yours seems complex in that attempts to map pages as well, for example. >>> In any case, regardless of whether you move the gup to the actual issue >>> side of things (which I think you should), then you'd want something >>> ala: >>> >>> if (req->task != current) >>> don't issue, -ECANCELED >>> >>> in your task_work.nvme_uring_task_cb >> >> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong >> into __io_uring_cmd_do_in_task then? Because my task_work_cb function >> (passed to io_uring_cmd_complete_in_task) doesn't even have the request. > > Yeah it probably does, the uring_cmd case is a bit special is that it's > a set of helpers around task_work that can be consumed by eg fuse and > ublk. The existing users don't really do anything complicated on that > side, hence there's no real need to check. But since the ring/task is > going away, we should be able to generically do it in the helpers like > you did below. That won't work, we should give commands an opportunity to clean up after themselves. I'm pretty sure it will break existing users. For now we can pass a flag to the callback, fuse would need to check it and fail. Compile tested only commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1 Author: Pavel Begunkov <asml.silence@gmail.com> Date: Fri Aug 30 15:43:32 2024 +0100 io_uring/cmd: let cmds tw know about dying task When the taks that submitted a request is dying, a task work for that request might get run by a kernel thread or even worse by a half dismantled task. We can't just cancel the task work without running the callback as the cmd might need to do some clean up, so pass a flag instead. If set, it's not safe to access any task resources and the callback is expected to cancel the cmd ASAP. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index ace7ac056d51..a89abec98832 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -37,6 +37,7 @@ enum io_uring_cmd_flags { /* set when uring wants to cancel a previously issued command */ IO_URING_F_CANCEL = (1 << 11), IO_URING_F_COMPAT = (1 << 12), + IO_URING_F_TASK_DEAD = (1 << 13), }; struct io_zcrx_ifq; diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c index 8391c7c7c1ec..55bdcb4b63b3 100644 --- a/io_uring/uring_cmd.c +++ b/io_uring/uring_cmd.c @@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable); static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state *ts) { struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd); + unsigned flags = IO_URING_F_COMPLETE_DEFER; + + if (req->task->flags & PF_EXITING) + flags |= IO_URING_F_TASK_DEAD; /* task_work executor checks the deffered list completion */ - ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER); + ioucmd->task_work_cb(ioucmd, flags); } void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd,
On 8/30/24 16:55, Pavel Begunkov wrote: > On 8/30/24 14:33, Jens Axboe wrote: >> On 8/30/24 7:28 AM, Bernd Schubert wrote: >>> On 8/30/24 15:12, Jens Axboe wrote: >>>> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>>>> We probably need to call iov_iter_get_pages2() immediately >>>>> on submitting the buffer from fuse server and not only when needed. >>>>> I had planned to do that as optimization later on, I think >>>>> it is also needed to avoid io_uring_cmd_complete_in_task(). >>>> >>>> I think you do, but it's not really what's wrong here - fallback >>>> work is >>>> being invoked as the ring is being torn down, either directly or >>>> because >>>> the task is exiting. Your task_work should check if this is the case, >>>> and just do -ECANCELED for this case rather than attempt to execute the >>>> work. Most task_work doesn't do much outside of post a completion, but >>>> yours seems complex in that attempts to map pages as well, for example. >>>> In any case, regardless of whether you move the gup to the actual issue >>>> side of things (which I think you should), then you'd want something >>>> ala: >>>> >>>> if (req->task != current) >>>> don't issue, -ECANCELED >>>> >>>> in your task_work.nvme_uring_task_cb >>> >>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong >>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function >>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request. >> >> Yeah it probably does, the uring_cmd case is a bit special is that it's >> a set of helpers around task_work that can be consumed by eg fuse and >> ublk. The existing users don't really do anything complicated on that >> side, hence there's no real need to check. But since the ring/task is >> going away, we should be able to generically do it in the helpers like >> you did below. > > That won't work, we should give commands an opportunity to clean up > after themselves. I'm pretty sure it will break existing users. > For now we can pass a flag to the callback, fuse would need to > check it and fail. Compile tested only > > commit a5b382f150b44476ccfa84cefdb22ce2ceeb12f1 > Author: Pavel Begunkov <asml.silence@gmail.com> > Date: Fri Aug 30 15:43:32 2024 +0100 > > io_uring/cmd: let cmds tw know about dying task > When the taks that submitted a request is dying, a task work for > that > request might get run by a kernel thread or even worse by a half > dismantled task. We can't just cancel the task work without running the > callback as the cmd might need to do some clean up, so pass a flag > instead. If set, it's not safe to access any task resources and the > callback is expected to cancel the cmd ASAP. > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > > diff --git a/include/linux/io_uring_types.h > b/include/linux/io_uring_types.h > index ace7ac056d51..a89abec98832 100644 > --- a/include/linux/io_uring_types.h > +++ b/include/linux/io_uring_types.h > @@ -37,6 +37,7 @@ enum io_uring_cmd_flags { > /* set when uring wants to cancel a previously issued command */ > IO_URING_F_CANCEL = (1 << 11), > IO_URING_F_COMPAT = (1 << 12), > + IO_URING_F_TASK_DEAD = (1 << 13), > }; > > struct io_zcrx_ifq; > diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c > index 8391c7c7c1ec..55bdcb4b63b3 100644 > --- a/io_uring/uring_cmd.c > +++ b/io_uring/uring_cmd.c > @@ -119,9 +119,13 @@ EXPORT_SYMBOL_GPL(io_uring_cmd_mark_cancelable); > static void io_uring_cmd_work(struct io_kiocb *req, struct io_tw_state > *ts) > { > struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct > io_uring_cmd); > + unsigned flags = IO_URING_F_COMPLETE_DEFER; > + > + if (req->task->flags & PF_EXITING) > + flags |= IO_URING_F_TASK_DEAD; > > /* task_work executor checks the deffered list completion */ > - ioucmd->task_work_cb(ioucmd, IO_URING_F_COMPLETE_DEFER); > + ioucmd->task_work_cb(ioucmd, flags); > } > > void __io_uring_cmd_do_in_task(struct io_uring_cmd *ioucmd, > Thanks and yeah you are right, the previous patch would have missed an io_uring_cmd_done for fuse-uring as well. Thanks, Bernd
On 8/30/24 8:55 AM, Pavel Begunkov wrote: > On 8/30/24 14:33, Jens Axboe wrote: >> On 8/30/24 7:28 AM, Bernd Schubert wrote: >>> On 8/30/24 15:12, Jens Axboe wrote: >>>> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>>>> We probably need to call iov_iter_get_pages2() immediately >>>>> on submitting the buffer from fuse server and not only when needed. >>>>> I had planned to do that as optimization later on, I think >>>>> it is also needed to avoid io_uring_cmd_complete_in_task(). >>>> >>>> I think you do, but it's not really what's wrong here - fallback work is >>>> being invoked as the ring is being torn down, either directly or because >>>> the task is exiting. Your task_work should check if this is the case, >>>> and just do -ECANCELED for this case rather than attempt to execute the >>>> work. Most task_work doesn't do much outside of post a completion, but >>>> yours seems complex in that attempts to map pages as well, for example. >>>> In any case, regardless of whether you move the gup to the actual issue >>>> side of things (which I think you should), then you'd want something >>>> ala: >>>> >>>> if (req->task != current) >>>> don't issue, -ECANCELED >>>> >>>> in your task_work.nvme_uring_task_cb >>> >>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong >>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function >>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request. >> >> Yeah it probably does, the uring_cmd case is a bit special is that it's >> a set of helpers around task_work that can be consumed by eg fuse and >> ublk. The existing users don't really do anything complicated on that >> side, hence there's no real need to check. But since the ring/task is >> going away, we should be able to generically do it in the helpers like >> you did below. > > That won't work, we should give commands an opportunity to clean up > after themselves. I'm pretty sure it will break existing users. > For now we can pass a flag to the callback, fuse would need to > check it and fail. Compile tested only Right, I did actually consider that yesterday and why I replied with the fuse callback needing to do it, but then forgot... Since we can't do a generic cleanup callback, it'll have to be done in the handler. I do like making this generic and not needing individual task_work handlers like this checking for some magic, so I like the flag addition.
On 8/30/24 22:08, Jens Axboe wrote: > On 8/30/24 8:55 AM, Pavel Begunkov wrote: >> On 8/30/24 14:33, Jens Axboe wrote: >>> On 8/30/24 7:28 AM, Bernd Schubert wrote: >>>> On 8/30/24 15:12, Jens Axboe wrote: >>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>>>>> We probably need to call iov_iter_get_pages2() immediately >>>>>> on submitting the buffer from fuse server and not only when needed. >>>>>> I had planned to do that as optimization later on, I think >>>>>> it is also needed to avoid io_uring_cmd_complete_in_task(). >>>>> >>>>> I think you do, but it's not really what's wrong here - fallback work is >>>>> being invoked as the ring is being torn down, either directly or because >>>>> the task is exiting. Your task_work should check if this is the case, >>>>> and just do -ECANCELED for this case rather than attempt to execute the >>>>> work. Most task_work doesn't do much outside of post a completion, but >>>>> yours seems complex in that attempts to map pages as well, for example. >>>>> In any case, regardless of whether you move the gup to the actual issue >>>>> side of things (which I think you should), then you'd want something >>>>> ala: >>>>> >>>>> if (req->task != current) >>>>> don't issue, -ECANCELED >>>>> >>>>> in your task_work.nvme_uring_task_cb >>>> >>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong >>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function >>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request. >>> >>> Yeah it probably does, the uring_cmd case is a bit special is that it's >>> a set of helpers around task_work that can be consumed by eg fuse and >>> ublk. The existing users don't really do anything complicated on that >>> side, hence there's no real need to check. But since the ring/task is >>> going away, we should be able to generically do it in the helpers like >>> you did below. >> >> That won't work, we should give commands an opportunity to clean up >> after themselves. I'm pretty sure it will break existing users. >> For now we can pass a flag to the callback, fuse would need to >> check it and fail. Compile tested only > > Right, I did actually consider that yesterday and why I replied with the > fuse callback needing to do it, but then forgot... Since we can't do a > generic cleanup callback, it'll have to be done in the handler. > > I do like making this generic and not needing individual task_work > handlers like this checking for some magic, so I like the flag addition. > Found another issue in (error handling in my code) while working on page pinning of the user buffer and fixed that first. Ways to late now (or early) to continue with the page pinning, but I gave Pavels patch a try with the additional patch below - same issue. I added a warn message to see if triggers - doesn't come up if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) { pr_warn("IO_URING_F_TASK_DEAD"); goto terminating; } I could digg further, but I'm actually not sure if we need to. With early page pinning the entire function should go away, as I hope that the application can write into the buffer again. Although I'm not sure yet if Miklos will like that pinning. bschubert2@imesrv6 linux.git>stg show handle-IO_URING_F_TASK_DEAD commit 42b4dae795bd37918455bad0ce3eea64b28be03c (HEAD -> fuse-uring-for-6.10-rfc3-without-mmap) Author: Bernd Schubert <bschubert@ddn.com> Date: Sat Aug 31 01:26:26 2024 +0200 fuse: {uring} Handle IO_URING_F_TASK_DEAD The ring task is terminating, it not safe to still access its resources. Also no need for further actions. diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c index e557f595133b..1d5dfa9c0965 100644 --- a/fs/fuse/dev_uring.c +++ b/fs/fuse/dev_uring.c @@ -1003,6 +1003,9 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd, BUILD_BUG_ON(sizeof(pdu) > sizeof(cmd->pdu)); int err; + if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) + goto terminating; + err = fuse_uring_prepare_send(ring_ent); if (err) goto err; @@ -1017,6 +1020,10 @@ fuse_uring_send_req_in_task(struct io_uring_cmd *cmd, return; err: fuse_uring_next_fuse_req(ring_ent, queue); + +terminating: + /* Avoid all actions as the task that issues the ring is terminating */ + io_uring_cmd_done(cmd, -ECANCELED, 0, issue_flags); } /* queue a fuse request and send it if a ring entry is available */ Thanks, Bernd
On 8/31/24 02:02, Bernd Schubert wrote: > > > On 8/30/24 22:08, Jens Axboe wrote: >> On 8/30/24 8:55 AM, Pavel Begunkov wrote: >>> On 8/30/24 14:33, Jens Axboe wrote: >>>> On 8/30/24 7:28 AM, Bernd Schubert wrote: >>>>> On 8/30/24 15:12, Jens Axboe wrote: >>>>>> On 8/29/24 4:32 PM, Bernd Schubert wrote: >>>>>>> We probably need to call iov_iter_get_pages2() immediately >>>>>>> on submitting the buffer from fuse server and not only when needed. >>>>>>> I had planned to do that as optimization later on, I think >>>>>>> it is also needed to avoid io_uring_cmd_complete_in_task(). >>>>>> >>>>>> I think you do, but it's not really what's wrong here - fallback work is >>>>>> being invoked as the ring is being torn down, either directly or because >>>>>> the task is exiting. Your task_work should check if this is the case, >>>>>> and just do -ECANCELED for this case rather than attempt to execute the >>>>>> work. Most task_work doesn't do much outside of post a completion, but >>>>>> yours seems complex in that attempts to map pages as well, for example. >>>>>> In any case, regardless of whether you move the gup to the actual issue >>>>>> side of things (which I think you should), then you'd want something >>>>>> ala: >>>>>> >>>>>> if (req->task != current) >>>>>> don't issue, -ECANCELED >>>>>> >>>>>> in your task_work.nvme_uring_task_cb >>>>> >>>>> Thanks a lot for your help Jens! I'm a bit confused, doesn't this belong >>>>> into __io_uring_cmd_do_in_task then? Because my task_work_cb function >>>>> (passed to io_uring_cmd_complete_in_task) doesn't even have the request. >>>> >>>> Yeah it probably does, the uring_cmd case is a bit special is that it's >>>> a set of helpers around task_work that can be consumed by eg fuse and >>>> ublk. The existing users don't really do anything complicated on that >>>> side, hence there's no real need to check. But since the ring/task is >>>> going away, we should be able to generically do it in the helpers like >>>> you did below. >>> >>> That won't work, we should give commands an opportunity to clean up >>> after themselves. I'm pretty sure it will break existing users. >>> For now we can pass a flag to the callback, fuse would need to >>> check it and fail. Compile tested only >> >> Right, I did actually consider that yesterday and why I replied with the >> fuse callback needing to do it, but then forgot... Since we can't do a >> generic cleanup callback, it'll have to be done in the handler. >> >> I do like making this generic and not needing individual task_work >> handlers like this checking for some magic, so I like the flag addition. >> > > Found another issue in (error handling in my code) while working on page > pinning of the user buffer and fixed that first. Ways to late now (or early) > to continue with the page pinning, but I gave Pavels patch a try with the > additional patch below - same issue. > I added a warn message to see if triggers - doesn't come up > > if (unlikely(issue_flags & IO_URING_F_TASK_DEAD)) { > pr_warn("IO_URING_F_TASK_DEAD"); > goto terminating; > } > > > I could digg further, but I'm actually not sure if we need to. With early page pinning > the entire function should go away, as I hope that the application can write into the > buffer again. Although I'm not sure yet if Miklos will like that pinning. Works with page pinning, new series comes once I got some sleep (still need to write the change log). Thanks, Bernd