Message ID | 20250128133927.3989681-1-max.kellermann@ionos.com (mailing list archive) |
---|---|
Headers | show |
Series | Various io_uring micro-optimizations (reducing lock contention) | expand |
On 1/28/25 6:39 AM, Max Kellermann wrote: > While optimizing my io_uring-based web server, I found that the kernel > spends 35% of the CPU time waiting for `io_wq_acct.lock`. This patch > set reduces contention of this lock, though I believe much more should > be done in order to allow more worker concurrency. > > I measured these patches with my HTTP server (serving static files and > running a tiny PHP script) and with a micro-benchmark that submits > millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force > offloading the operation to a worker, so this offload overhead can be > measured). > > Some of the optimizations eliminate memory accesses, e.g. by passing > values that are already known to (inlined) functions and by caching > values in local variables. These are useful optimizations, but they > are too small to measure them in a benchmark (too much noise). > > Some of the patches have a measurable effect and they contain > benchmark numbers that I could reproduce in repeated runs, despite the > noise. > > I'm not confident about the correctness of the last patch ("io_uring: > skip redundant poll wakeups"). This seemed like low-hanging fruit, so > low that it seemed suspicious to me. If this is a useful > optimization, the idea could probably be ported to other wait_queue > users, or even into the wait_queue library. What I'm not confident > about is whether the optimization is valid or whether it may miss > wakeups, leading to stalls. Please advise! That last patch is the only one that needs a bit more checking, so I'd suggest we just ignore that one for now. We're in the middle of the merge window anyway, so all of this would have to wait for the 6.15 merge window anyway - iow, plenty of time. The other patches look pretty straight forward to me. Only thing that has me puzzled a bit is why you have so much io-wq activity with your application, in general I'd expect 0 activity there. But Then I saw the forced ASYNC flag, and it makes sense. In general, forcing that isn't a great idea, but for a benchmark for io-wq it certainly makes sense. I'll apply 1-7 once 6.14-rc1 is out and I can kick off a for-6.15/io_uring branch. Thanks!
On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote: > The other patches look pretty straight forward to me. Only thing that > has me puzzled a bit is why you have so much io-wq activity with your > application, in general I'd expect 0 activity there. But Then I saw the > forced ASYNC flag, and it makes sense. In general, forcing that isn't a > great idea, but for a benchmark for io-wq it certainly makes sense. I was experimenting with io_uring and wanted to see how much performance I can squeeze out of my web server running single-threaded. The overhead of io_uring_submit() grew very large, because the "send" operation would do a lot of synchronous work in the kernel. I tried SQPOLL but it was actually a big performance regression; this just shifted my CPU usage to epoll_wait(). Forcing ASYNC gave me large throughput improvements (moving the submission overhead to iowq), but then the iowq lock contention was the next limit, thus this patch series. I'm still experimenting, and I will certainly revisit SQPOLL to learn more about why it didn't help and how to fix it. > I'll apply 1-7 once 6.14-rc1 is out and I can kick off a > for-6.15/io_uring branch. Thanks! Thanks Jens, and please let me know when you're ready to discuss the last patch. It's a big improvement for those who combine io_uring with epoll, it's worth it. Max
On 1/29/25 10:39 AM, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 6:19?PM Jens Axboe <axboe@kernel.dk> wrote: >> The other patches look pretty straight forward to me. Only thing that >> has me puzzled a bit is why you have so much io-wq activity with your >> application, in general I'd expect 0 activity there. But Then I saw the >> forced ASYNC flag, and it makes sense. In general, forcing that isn't a >> great idea, but for a benchmark for io-wq it certainly makes sense. > > I was experimenting with io_uring and wanted to see how much > performance I can squeeze out of my web server running > single-threaded. The overhead of io_uring_submit() grew very large, > because the "send" operation would do a lot of synchronous work in the > kernel. I tried SQPOLL but it was actually a big performance > regression; this just shifted my CPU usage to epoll_wait(). Forcing > ASYNC gave me large throughput improvements (moving the submission > overhead to iowq), but then the iowq lock contention was the next > limit, thus this patch series. > > I'm still experimenting, and I will certainly revisit SQPOLL to learn > more about why it didn't help and how to fix it. Why are you combining it with epoll in the first place? It's a lot more efficient to wait on a/multiple events in io_uring_enter() rather than go back to a serialize one-event-per-notification by using epoll to wait on completions on the io_uring side.
On Wed, Jan 29, 2025 at 6:45 PM Jens Axboe <axboe@kernel.dk> wrote: > Why are you combining it with epoll in the first place? It's a lot more > efficient to wait on a/multiple events in io_uring_enter() rather than > go back to a serialize one-event-per-notification by using epoll to wait > on completions on the io_uring side. Yes, I wish I could do that, but that works only if everything is io_uring - all or nothing. Most of the code is built around an epoll-based loop and will not be ported to io_uring so quickly. Maybe what's missing is epoll_wait as io_uring opcode. Then I could wrap it the other way. Or am I supposed to use io_uring poll_add_multishot for that? Max
On 1/29/25 17:39, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 6:19 PM Jens Axboe <axboe@kernel.dk> wrote: >> The other patches look pretty straight forward to me. Only thing that >> has me puzzled a bit is why you have so much io-wq activity with your >> application, in general I'd expect 0 activity there. But Then I saw the >> forced ASYNC flag, and it makes sense. In general, forcing that isn't a >> great idea, but for a benchmark for io-wq it certainly makes sense. > > I was experimenting with io_uring and wanted to see how much > performance I can squeeze out of my web server running > single-threaded. The overhead of io_uring_submit() grew very large, > because the "send" operation would do a lot of synchronous work in the > kernel. I tried SQPOLL but it was actually a big performance > regression; this just shifted my CPU usage to epoll_wait(). Forcing > ASYNC gave me large throughput improvements (moving the submission > overhead to iowq), but then the iowq lock contention was the next > limit, thus this patch series. It's great to see iowq getting some optimisations, but note that it wouldn't be fair comparing it to single threaded peers when you have a lot of iowq activity as it might be occupying multiple CPUs. A curious open question is whether it'd be more performant to have several user threads with their private rings. > I'm still experimenting, and I will certainly revisit SQPOLL to learn > more about why it didn't help and how to fix it. It's wasteful unless you saturate it close to 100%, and then you usually have SQPOLL on a separate CPU than the user task submitting requests, and so it'd take some cache bouncing. It's not a silver bullet.
On Wed, Jan 29, 2025 at 8:30 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > It's great to see iowq getting some optimisations, but note that > it wouldn't be fair comparing it to single threaded peers when > you have a lot of iowq activity as it might be occupying multiple > CPUs. True. Fully loaded with the benchmark, I see 400%-600% CPU usage on my process (30-40% of that being spinlock contention). I wanted to explore how far I can get with a single (userspace) thread, and leave the dirty thread-sync work to the kernel. > It's wasteful unless you saturate it close to 100%, and then you > usually have SQPOLL on a separate CPU than the user task submitting > requests, and so it'd take some cache bouncing. It's not a silver > bullet. Of course, memory latency always bites us in the end. But this isn't the endgame just yet, we still have a lot of potential for optimizations.
On 1/29/25 11:01 AM, Max Kellermann wrote: > On Wed, Jan 29, 2025 at 6:45?PM Jens Axboe <axboe@kernel.dk> wrote: >> Why are you combining it with epoll in the first place? It's a lot more >> efficient to wait on a/multiple events in io_uring_enter() rather than >> go back to a serialize one-event-per-notification by using epoll to wait >> on completions on the io_uring side. > > Yes, I wish I could do that, but that works only if everything is > io_uring - all or nothing. Most of the code is built around an > epoll-based loop and will not be ported to io_uring so quickly. > > Maybe what's missing is epoll_wait as io_uring opcode. Then I could > wrap it the other way. Or am I supposed to use io_uring > poll_add_multishot for that? Not a huge fan of adding more epoll logic to io_uring, but you are right this case may indeed make sense as it allows you to integrate better that way in existing event loops. I'll take a look.
On 1/31/25 9:13 AM, Jens Axboe wrote: > On 1/29/25 11:01 AM, Max Kellermann wrote: >> On Wed, Jan 29, 2025 at 6:45?PM Jens Axboe <axboe@kernel.dk> wrote: >>> Why are you combining it with epoll in the first place? It's a lot more >>> efficient to wait on a/multiple events in io_uring_enter() rather than >>> go back to a serialize one-event-per-notification by using epoll to wait >>> on completions on the io_uring side. >> >> Yes, I wish I could do that, but that works only if everything is >> io_uring - all or nothing. Most of the code is built around an >> epoll-based loop and will not be ported to io_uring so quickly. >> >> Maybe what's missing is epoll_wait as io_uring opcode. Then I could >> wrap it the other way. Or am I supposed to use io_uring >> poll_add_multishot for that? > > Not a huge fan of adding more epoll logic to io_uring, but you are right > this case may indeed make sense as it allows you to integrate better > that way in existing event loops. I'll take a look. Here's a series doing that: https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait Could actually work pretty well - the last patch adds multishot support as well, which means we can avoid the write lock dance for repeated triggers of this epoll event. That should actually end up being more efficient than regular epoll_wait(2). Wrote a basic test cases to exercise it, and it seems to work fine for me, but obviously not super well tested just yet. Below is the liburing diff, just adds the helper to prepare one of these epoll wait requests. diff --git a/src/include/liburing.h b/src/include/liburing.h index 49b4edf437b2..a95c475496f4 100644 --- a/src/include/liburing.h +++ b/src/include/liburing.h @@ -729,6 +729,15 @@ IOURINGINLINE void io_uring_prep_listen(struct io_uring_sqe *sqe, int fd, io_uring_prep_rw(IORING_OP_LISTEN, sqe, fd, 0, backlog, 0); } +struct epoll_event; +IOURINGINLINE void io_uring_prep_epoll_wait(struct io_uring_sqe *sqe, int fd, + struct epoll_event *events, + int maxevents, unsigned flags) +{ + io_uring_prep_rw(IORING_OP_EPOLL_WAIT, sqe, fd, events, maxevents, 0); + sqe->epoll_flags = flags; +} + IOURINGINLINE void io_uring_prep_files_update(struct io_uring_sqe *sqe, int *fds, unsigned nr_fds, int offset) diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h index 765919883cff..bc725787ceb7 100644 --- a/src/include/liburing/io_uring.h +++ b/src/include/liburing/io_uring.h @@ -73,6 +73,7 @@ struct io_uring_sqe { __u32 futex_flags; __u32 install_fd_flags; __u32 nop_flags; + __u32 epoll_flags; }; __u64 user_data; /* data to be passed back at completion time */ /* pack this to avoid bogus arm OABI complaints */ @@ -262,6 +263,7 @@ enum io_uring_op { IORING_OP_FTRUNCATE, IORING_OP_BIND, IORING_OP_LISTEN, + IORING_OP_EPOLL_WAIT, /* this goes last, obviously */ IORING_OP_LAST, @@ -388,6 +390,11 @@ enum io_uring_op { #define IORING_ACCEPT_DONTWAIT (1U << 1) #define IORING_ACCEPT_POLL_FIRST (1U << 2) +/* + * epoll_wait flags, stored in sqe->epoll_flags + */ +#define IORING_EPOLL_WAIT_MULTISHOT (1U << 0) + /* * IORING_OP_MSG_RING command types, stored in sqe->addr */
On Sat, Feb 1, 2025 at 4:26 PM Jens Axboe <axboe@kernel.dk> wrote: > > Not a huge fan of adding more epoll logic to io_uring, but you are right > > this case may indeed make sense as it allows you to integrate better > > that way in existing event loops. I'll take a look. > > Here's a series doing that: > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait > > Could actually work pretty well - the last patch adds multishot support > as well, which means we can avoid the write lock dance for repeated > triggers of this epoll event. That should actually end up being more > efficient than regular epoll_wait(2). Nice, thanks Jens! I will integrate this in our I/O event loop and test it next week. This will eliminate the io_uring poll wakeup overhead completely.
On 2/1/25 8:30 AM, Max Kellermann wrote: > On Sat, Feb 1, 2025 at 4:26 PM Jens Axboe <axboe@kernel.dk> wrote: >>> Not a huge fan of adding more epoll logic to io_uring, but you are right >>> this case may indeed make sense as it allows you to integrate better >>> that way in existing event loops. I'll take a look. >> >> Here's a series doing that: >> >> https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait >> >> Could actually work pretty well - the last patch adds multishot support >> as well, which means we can avoid the write lock dance for repeated >> triggers of this epoll event. That should actually end up being more >> efficient than regular epoll_wait(2). > > Nice, thanks Jens! I will integrate this in our I/O event loop and > test it next week. This will eliminate the io_uring poll wakeup > overhead completely. That'd be great, let us know how it goes!