Message ID | 20240524230501.20178-1-axboe@kernel.dk (mailing list archive) |
---|---|
Headers | show |
Series | Improve MSG_RING SINGLE_ISSUER performance | expand |
On 5/24/24 23:58, Jens Axboe wrote: > Hi, > > A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's specifically an IORING_SETUP_DEFER_TASKRUN optimisation. > use IORING_SETUP_DEFER_TASKRUN, will need two round trips through > generic task_work. This isn't ideal. This patchset attempts to rectify > that, taking a new approach rather than trying to use the io_uring > task_work infrastructure to handle it as in previous postings. Not sure why you'd want to piggyback onto overflows, it's not such a well made and reliable infra, whereas the DEFER_TASKRUN part of the task_work approach was fine. The completion path doesn't usually look at the overflow list but on cached cqe pointers showing the CQ is full, that means after you queue an overflow someone may post a CQE in the CQ in the normal path and you get reordering. Not that bad considering it's from another ring, but a bit nasty and surely will bite us back in the future, it always does. That's assuming you decide io_msg_need_remote() solely based ->task_complete and remove return current != target_ctx->submitter_task; otherwise you can get two linked msg_ring target CQEs reordered. It's also duplicating that crappy overflow code nobody cares much about, and it's still a forced wake up with no batching, circumventing the normal wake up path, i.e. io_uring tw. I don't think we should care about the request completion latency (sender latency), people should be more interested in the reaction speed (receiver latency), but if you care about it for a reason, perhaps you can just as well allocate a new request and complete the previous one right away. > In a sample test app that has one thread send messages to another and > logging both the time between sender sending and receiver receving and > just the time for the sender to post a message and get the CQE back, > I see the following sender latencies with the stock kernel: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 4384], 5.0000th=[ 4512], 10.0000th=[ 4576], > | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024], > | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280], > | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], > | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376], > | 99.9500th=[26496], 99.9900th=[59136] > > and with the patches: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 756], 5.0000th=[ 820], 10.0000th=[ 828], > | 20.0000th=[ 844], 30.0000th=[ 852], 40.0000th=[ 852], > | 50.0000th=[ 860], 60.0000th=[ 860], 70.0000th=[ 868], > | 80.0000th=[ 884], 90.0000th=[ 964], 95.0000th=[ 988], > | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544], > | 99.9500th=[ 1944], 99.9900th=[ 2896] > > For the receiving side the win is smaller as it only "suffers" from > a single generic task_work, about a 10% win in latencies there. > > The idea here is to utilize the CQE overflow infrastructure for this, > as that allows the right task to post the CQE to the ring. > > 1 is a basic refactoring prep patch, patch 2 adds support for normal > messages, and patch 3 adopts the same approach for fd passing. > > io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 138 insertions(+), 13 deletions(-) >
On 5/28/24 7:31 AM, Pavel Begunkov wrote: > On 5/24/24 23:58, Jens Axboe wrote: >> Hi, >> >> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to > > IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's > specifically an IORING_SETUP_DEFER_TASKRUN optimisation. Right, I should change that in the commit message. It's task_complete driving it, which is tied to DEFER_TASKRUN. >> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >> generic task_work. This isn't ideal. This patchset attempts to rectify >> that, taking a new approach rather than trying to use the io_uring >> task_work infrastructure to handle it as in previous postings. > > Not sure why you'd want to piggyback onto overflows, it's not > such a well made and reliable infra, whereas the DEFER_TASKRUN > part of the task_work approach was fine. It's not right now, because it's really a "don't get into this condition, if you do, things are slower". And this series doesn't really change that, and honestly it doesn't even need to. It's still way better than what we had before in terms of DEFER_TASKRUN and messages. We could certainly make the messages a subset of real overflow if we wanted, but honestly it looks decent enough as-is with the changes that I'm not hugely motivated to do that. > The completion path doesn't usually look at the overflow list > but on cached cqe pointers showing the CQ is full, that means > after you queue an overflow someone may post a CQE in the CQ > in the normal path and you get reordering. Not that bad > considering it's from another ring, but a bit nasty and surely > will bite us back in the future, it always does. This is true, but generally true as well as completions come in async. You don't get to control the exact order on a remote ring. Messages themselves will be ordered when posted, which should be the important aspect here. Order with locally posted completions I don't see why you'd care, that's a timing game that you cannot control. > That's assuming you decide io_msg_need_remote() solely based > ->task_complete and remove > > return current != target_ctx->submitter_task; > > otherwise you can get two linked msg_ring target CQEs reordered. Good point, since it'd now be cheap enough, would probably make sense to simply gate it on task_complete alone. I even toyed with the idea of just using this approach for any ring type and kill some code in there, but didn't venture that far yet. > It's also duplicating that crappy overflow code nobody cares > much about, and it's still a forced wake up with no batching, > circumventing the normal wake up path, i.e. io_uring tw. Yes, since this is v1 I didn't bother to integrate more tightly with the generic overflows, that should obviously be done by first adding a helper for this. I consider that pretty minor. > I don't think we should care about the request completion > latency (sender latency), people should be more interested > in the reaction speed (receiver latency), but if you care > about it for a reason, perhaps you can just as well allocate > a new request and complete the previous one right away. I know the numbers I posted was just sender side improvements on that particular box, however that isn't really the case for others. Here's on an another test box: axboe@r7525 ~> ./msg-lat init_flags=3000 Wait on startup 802775: my fd 3, other 4 802776: my fd 4, other 3 Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], | 99.9500th=[20352], 99.9900th=[28288] Latencies for: Sender percentiles (nsec): | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], | 99.9500th=[28032], 99.9900th=[40192] and after: axboe@r7525 ~> ./msg-lat 1.776s init_flags=3000 Wait on startup 3767: my fd 3, other 4 3768: my fd 4, other 3 Latencies for: Sender percentiles (nsec): | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], | 99.9500th=[ 1976], 99.9900th=[ 3408] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], | 99.9500th=[21632], 99.9900th=[58624] Obivously some variation in runs in general, but it's most certainly faster in terms of receiving too. This test case is fixed at doing 100K messages per second, didn't do any peak testing. But I strongly suspect you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL task_work is pretty sucky imho. You could just make it io_kiocb based, but I did not want to get into foreign requests on remote rings. What would you envision with that approach, using our normal ring task_work for this instead? That would be an approach, obviously this one took a different path from the previous task_work driven approach.
On 5/28/24 8:34 AM, Jens Axboe wrote: > You could just make it io_kiocb based, but I did not want to get into > foreign requests on remote rings. What would you envision with that > approach, using our normal ring task_work for this instead? That would > be an approach, obviously this one took a different path from the > previous task_work driven approach. Foreign io_kiocb itself isn't even enough, you'd need an overflow cqe allocated upfront too for posting directly without waiting for the reply. Basically it needs to be confident it'll post successfully on the remote ring, otherwise we cannot complete the source msg ring sqe inline.
On 5/28/24 8:34 AM, Jens Axboe wrote: > Obivously some variation in runs in general, but it's most certainly > faster in terms of receiving too. This test case is fixed at doing 100K > messages per second, didn't do any peak testing. But I strongly suspect > you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL > task_work is pretty sucky imho. Example on that same box. Running the same message passer test, sys time used for a 3 second run varies between 300ms and 850ms on the stock kernel (with a lot of variance), after it's around ~150ms and pretty steady.
On 5/28/24 15:34, Jens Axboe wrote: > On 5/28/24 7:31 AM, Pavel Begunkov wrote: >> On 5/24/24 23:58, Jens Axboe wrote: >>> Hi, >>> >>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >> >> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. > > Right, I should change that in the commit message. It's task_complete > driving it, which is tied to DEFER_TASKRUN. > >>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>> generic task_work. This isn't ideal. This patchset attempts to rectify >>> that, taking a new approach rather than trying to use the io_uring >>> task_work infrastructure to handle it as in previous postings. >> >> Not sure why you'd want to piggyback onto overflows, it's not >> such a well made and reliable infra, whereas the DEFER_TASKRUN >> part of the task_work approach was fine. > > It's not right now, because it's really a "don't get into this > condition, if you do, things are slower". And this series doesn't really > change that, and honestly it doesn't even need to. It's still way better > than what we had before in terms of DEFER_TASKRUN and messages. Better than how it is now or comparing to the previous attempt? I think the one using io_uring's tw infra was better, which is where all wake ups and optimisations currently consolidate. > We could certainly make the messages a subset of real overflow if we > wanted, but honestly it looks decent enough as-is with the changes that > I'm not hugely motivated to do that. Not sure what you mean here, but not really suggesting refactoring overflows. Taking the tw based patches should be easy, it only needs killing !DEFER_TASKRUN changes from there. >> The completion path doesn't usually look at the overflow list >> but on cached cqe pointers showing the CQ is full, that means >> after you queue an overflow someone may post a CQE in the CQ >> in the normal path and you get reordering. Not that bad >> considering it's from another ring, but a bit nasty and surely >> will bite us back in the future, it always does. > > This is true, but generally true as well as completions come in async. > You don't get to control the exact order on a remote ring. Messages > themselves will be ordered when posted, which should be the important > aspect here. Order with locally posted completions I don't see why you'd > care, that's a timing game that you cannot control. True for a single request, but in a more complex system sender's ordering will affect the order on the receiving side. ring1: msg_ring(); write(pipe) ring2: read(pipe) The user would definitely think that the other ring will first get a msg_ring CQE and then an CQE from the read, but as always it's hard to predict all repercussions. >> That's assuming you decide io_msg_need_remote() solely based >> ->task_complete and remove >> >> return current != target_ctx->submitter_task; >> >> otherwise you can get two linked msg_ring target CQEs reordered. > > Good point, since it'd now be cheap enough, would probably make sense to > simply gate it on task_complete alone. I even toyed with the idea of > just using this approach for any ring type and kill some code in there, > but didn't venture that far yet. That task check doesn't make any real difference. If it's the same thread you can skip io_uring all together. >> It's also duplicating that crappy overflow code nobody cares >> much about, and it's still a forced wake up with no batching, >> circumventing the normal wake up path, i.e. io_uring tw. > > Yes, since this is v1 I didn't bother to integrate more tightly with the > generic overflows, that should obviously be done by first adding a > helper for this. I consider that pretty minor. My problem is not duplication of code base but rather extending the internal user base of it. You can say that msg_ring can easily become a hot path for some, and then we'll be putting efforts both into overflows and task_work when in essence they solve quite a similar problem here. >> I don't think we should care about the request completion >> latency (sender latency), people should be more interested >> in the reaction speed (receiver latency), but if you care >> about it for a reason, perhaps you can just as well allocate >> a new request and complete the previous one right away. > > I know the numbers I posted was just sender side improvements on that > particular box, however that isn't really the case for others. Here's on > an another test box: > > axboe@r7525 ~> ./msg-lat > init_flags=3000 > Wait on startup > 802775: my fd 3, other 4 > 802776: my fd 4, other 3 > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], > | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], > | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], > | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], > | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], > | 99.9500th=[20352], 99.9900th=[28288] > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], > | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], > | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], > | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], > | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], > | 99.9500th=[28032], 99.9900th=[40192] > > and after: > > axboe@r7525 ~> ./msg-lat 1.776s > init_flags=3000 > Wait on startup > 3767: my fd 3, other 4 > 3768: my fd 4, other 3 > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], > | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], > | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], > | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], > | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], > | 99.9500th=[ 1976], 99.9900th=[ 3408] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], > | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], > | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], > | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], > | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], > | 99.9500th=[21632], 99.9900th=[58624] > > Obivously some variation in runs in general, but it's most certainly > faster in terms of receiving too. This test case is fixed at doing 100K > messages per second, didn't do any peak testing. But I strongly suspect > you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL > task_work is pretty sucky imho. Sure, you mentioned wins on the receiver side, I consider it to be the main merit (latency and throughput) > You could just make it io_kiocb based, but I did not want to get into > foreign requests on remote rings. What would you envision with that > approach, using our normal ring task_work for this instead? That would It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care about it because it just does it all under ->completion_lock, which is why you shouldn't have ever hit the problem in testing. > be an approach, obviously this one took a different path from the > previous task_work driven approach.
On 5/28/24 10:50 AM, Pavel Begunkov wrote: > On 5/28/24 15:34, Jens Axboe wrote: >> On 5/28/24 7:31 AM, Pavel Begunkov wrote: >>> On 5/24/24 23:58, Jens Axboe wrote: >>>> Hi, >>>> >>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >>> >>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. >> >> Right, I should change that in the commit message. It's task_complete >> driving it, which is tied to DEFER_TASKRUN. >> >>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>>> generic task_work. This isn't ideal. This patchset attempts to rectify >>>> that, taking a new approach rather than trying to use the io_uring >>>> task_work infrastructure to handle it as in previous postings. >>> >>> Not sure why you'd want to piggyback onto overflows, it's not >>> such a well made and reliable infra, whereas the DEFER_TASKRUN >>> part of the task_work approach was fine. >> >> It's not right now, because it's really a "don't get into this >> condition, if you do, things are slower". And this series doesn't really >> change that, and honestly it doesn't even need to. It's still way better >> than what we had before in terms of DEFER_TASKRUN and messages. > > Better than how it is now or comparing to the previous attempt? > I think the one using io_uring's tw infra was better, which is > where all wake ups and optimisations currently consolidate. Better than both - I haven't tested with the previous version, but I can certainly do that. The reason why I think it'll be better is that it avoids the double roundtrips. Yes v1 was using normal task_work which is better, but it didn't solve what I think is the fundamental issue here. I'll forward port it and give it a spin, then we'll know. >> We could certainly make the messages a subset of real overflow if we >> wanted, but honestly it looks decent enough as-is with the changes that >> I'm not hugely motivated to do that. > > Not sure what you mean here, but not really suggesting refactoring > overflows. Taking the tw based patches should be easy, it only > needs killing !DEFER_TASKRUN changes from there. Sorry wasn't clear, the refactoring was purely my suggestion to reduce a bit of the code duplication. >>> The completion path doesn't usually look at the overflow list >>> but on cached cqe pointers showing the CQ is full, that means >>> after you queue an overflow someone may post a CQE in the CQ >>> in the normal path and you get reordering. Not that bad >>> considering it's from another ring, but a bit nasty and surely >>> will bite us back in the future, it always does. >> >> This is true, but generally true as well as completions come in async. >> You don't get to control the exact order on a remote ring. Messages >> themselves will be ordered when posted, which should be the important >> aspect here. Order with locally posted completions I don't see why you'd >> care, that's a timing game that you cannot control. > > True for a single request, but in a more complex system > sender's ordering will affect the order on the receiving side. > > ring1: msg_ring(); write(pipe) > ring2: read(pipe) > > The user would definitely think that the other ring will > first get a msg_ring CQE and then an CQE from the read, but as > always it's hard to predict all repercussions. Nobody should be making assumptions like that, and that will in fact already not be the case. If the msg_ring fails to lock the remote ring, then it may very well end up in the hands of io-wq. And then you could get either result validly, msg CQE first or last. >>> That's assuming you decide io_msg_need_remote() solely based >>> ->task_complete and remove >>> >>> return current != target_ctx->submitter_task; >>> >>> otherwise you can get two linked msg_ring target CQEs reordered. >> >> Good point, since it'd now be cheap enough, would probably make sense to >> simply gate it on task_complete alone. I even toyed with the idea of >> just using this approach for any ring type and kill some code in there, >> but didn't venture that far yet. > > That task check doesn't make any real difference. If it's the > same thread you can skip io_uring all together. Yeah agree, I did add a patch in between after the last email to just remove the task check. It's not really a useful case to attempt to cater to in particular. >>> It's also duplicating that crappy overflow code nobody cares >>> much about, and it's still a forced wake up with no batching, >>> circumventing the normal wake up path, i.e. io_uring tw. >> >> Yes, since this is v1 I didn't bother to integrate more tightly with the >> generic overflows, that should obviously be done by first adding a >> helper for this. I consider that pretty minor. > > My problem is not duplication of code base but rather > extending the internal user base of it. You can say that > msg_ring can easily become a hot path for some, and > then we'll be putting efforts both into overflows and > task_work when in essence they solve quite a similar > problem here. That's why I was tempted to remove the task_work path for messages on top of this. But I agree on the wakeup side, that's obviously something that doesn't currently work with msg ring. And while I don't see that as a hard problem to solve, it is a bit annoying to have multiple sources for that. Would naturally be better to retain just the task_work side. For one use case that I think is interesting with messages is work passing, eliminating a user side data structure (and lock) for that and side channel wakeups, I don't think the wakeup batching matters as you're generally not going to be batching receiving work. You're either already running work for processing, or sleeping waiting for one. >>> I don't think we should care about the request completion >>> latency (sender latency), people should be more interested >>> in the reaction speed (receiver latency), but if you care >>> about it for a reason, perhaps you can just as well allocate >>> a new request and complete the previous one right away. >> >> I know the numbers I posted was just sender side improvements on that >> particular box, however that isn't really the case for others. Here's on >> an another test box: >> >> axboe@r7525 ~> ./msg-lat >> init_flags=3000 >> Wait on startup >> 802775: my fd 3, other 4 >> 802776: my fd 4, other 3 >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 4192], 5.0000th=[ 4320], 10.0000th=[ 4448], >> | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832], >> | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216], >> | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728], >> | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584], >> | 99.9500th=[20352], 99.9900th=[28288] >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 6560], 5.0000th=[ 6880], 10.0000th=[ 7008], >> | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712], >> | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512], >> | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152], >> | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168], >> | 99.9500th=[28032], 99.9900th=[40192] >> >> and after: >> >> axboe@r7525 ~> ./msg-lat 1.776s >> init_flags=3000 >> Wait on startup >> 3767: my fd 3, other 4 >> 3768: my fd 4, other 3 >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], >> | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], >> | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], >> | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], >> | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], >> | 99.9500th=[ 1976], 99.9900th=[ 3408] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], >> | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], >> | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], >> | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], >> | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], >> | 99.9500th=[21632], 99.9900th=[58624] >> >> Obivously some variation in runs in general, but it's most certainly >> faster in terms of receiving too. This test case is fixed at doing 100K >> messages per second, didn't do any peak testing. But I strongly suspect >> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL >> task_work is pretty sucky imho. > > Sure, you mentioned wins on the receiver side, I consider it > to be the main merit (latency and throughput) Actually I think both are interesting - not because the sender side is latency sensitive for on receving the CQE for it, but because it goes hand in hand with a reduction in cycles spent sending that work. That's the win on the sender side, more so than the latency win. The latter is just gravy on top. >> You could just make it io_kiocb based, but I did not want to get into >> foreign requests on remote rings. What would you envision with that >> approach, using our normal ring task_work for this instead? That would > > It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care > about it because it just does it all under ->completion_lock, which > is why you shouldn't have ever hit the problem in testing. I'll test the old approach and we'll see where we are at.
On 5/28/24 12:07 PM, Jens Axboe wrote: > On 5/28/24 10:50 AM, Pavel Begunkov wrote: >> On 5/28/24 15:34, Jens Axboe wrote: >>> On 5/28/24 7:31 AM, Pavel Begunkov wrote: >>>> On 5/24/24 23:58, Jens Axboe wrote: >>>>> Hi, >>>>> >>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >>>> >>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. >>> >>> Right, I should change that in the commit message. It's task_complete >>> driving it, which is tied to DEFER_TASKRUN. >>> >>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>>>> generic task_work. This isn't ideal. This patchset attempts to rectify >>>>> that, taking a new approach rather than trying to use the io_uring >>>>> task_work infrastructure to handle it as in previous postings. >>>> >>>> Not sure why you'd want to piggyback onto overflows, it's not >>>> such a well made and reliable infra, whereas the DEFER_TASKRUN >>>> part of the task_work approach was fine. >>> >>> It's not right now, because it's really a "don't get into this >>> condition, if you do, things are slower". And this series doesn't really >>> change that, and honestly it doesn't even need to. It's still way better >>> than what we had before in terms of DEFER_TASKRUN and messages. >> >> Better than how it is now or comparing to the previous attempt? >> I think the one using io_uring's tw infra was better, which is >> where all wake ups and optimisations currently consolidate. > > Better than both - I haven't tested with the previous version, but I can > certainly do that. The reason why I think it'll be better is that it > avoids the double roundtrips. Yes v1 was using normal task_work which is > better, but it didn't solve what I think is the fundamental issue here. > > I'll forward port it and give it a spin, then we'll know. I suspect a bug in the previous patches, because this is what the forward port looks like. First, for reference, the current results: init_flags=3000 Wait on startup 3767: my fd 3, other 4 3768: my fd 4, other 3 Latencies for: Sender percentiles (nsec): | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], | 99.9500th=[ 1976], 99.9900th=[ 3408] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], | 99.9500th=[21632], 99.9900th=[58624] and here's with io_uring-msg_ring.1, which is just a straight forward forward port of the previous patches on the same base as v2: init_flags=3000 Wait on startup 4097: my fd 4, other 3 4096: my fd 3, other 4 Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 5920], 5.0000th=[ 5920], 10.0000th=[ 5984], | 20.0000th=[ 5984], 30.0000th=[ 6048], 40.0000th=[ 6048], | 50.0000th=[ 6112], 60.0000th=[ 6304], 70.0000th=[ 6368], | 80.0000th=[ 6560], 90.0000th=[ 6880], 95.0000th=[ 7072], | 99.0000th=[ 7456], 99.5000th=[ 7712], 99.9000th=[ 8640], | 99.9500th=[10432], 99.9900th=[26240] Latencies for: Sender percentiles (nsec): | 1.0000th=[ 9536], 5.0000th=[ 9664], 10.0000th=[ 9664], | 20.0000th=[ 9920], 30.0000th=[ 9920], 40.0000th=[10048], | 50.0000th=[10176], 60.0000th=[10304], 70.0000th=[10432], | 80.0000th=[10688], 90.0000th=[10944], 95.0000th=[11328], | 99.0000th=[11840], 99.5000th=[12096], 99.9000th=[13888], | 99.9500th=[15424], 99.9900th=[34560]
On 5/28/24 12:31 PM, Jens Axboe wrote: > I suspect a bug in the previous patches, because this is what the > forward port looks like. First, for reference, the current results: Got it sorted, and pinned sender and receiver on CPUs to avoid the variation. It looks like this with the task_work approach that I sent out as v1: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], | 99.9500th=[16320], 99.9900th=[18304] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], | 99.9500th=[15424], 99.9900th=[17280] and here's the exact same test run on the current patches: Latencies for: Sender percentiles (nsec): | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], | 99.9500th=[ 972], 99.9900th=[ 1432] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], | 99.9500th=[14784], 99.9900th=[73216] I'll try and augment the test app to do proper rated submissions, so I can ramp up the rates a bit and see what happens.
On 5/28/24 5:04 PM, Jens Axboe wrote: > On 5/28/24 12:31 PM, Jens Axboe wrote: >> I suspect a bug in the previous patches, because this is what the >> forward port looks like. First, for reference, the current results: > > Got it sorted, and pinned sender and receiver on CPUs to avoid the > variation. It looks like this with the task_work approach that I sent > out as v1: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], > | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], > | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], > | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], > | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], > | 99.9500th=[16320], 99.9900th=[18304] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], > | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], > | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], > | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], > | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], > | 99.9500th=[15424], 99.9900th=[17280] > > and here's the exact same test run on the current patches: > > Latencies for: Sender > percentiles (nsec): > | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], > | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], > | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], > | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], > | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], > | 99.9500th=[ 972], 99.9900th=[ 1432] > Latencies for: Receiver > percentiles (nsec): > | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], > | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], > | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], > | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], > | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], > | 99.9500th=[14784], 99.9900th=[73216] > > I'll try and augment the test app to do proper rated submissions, so I > can ramp up the rates a bit and see what happens. And the final one, with the rated sends sorted out. One key observation is that v1 trails the current edition, it just can't keep up as the rate is increased. If we cap the rate at at what should be 33K messages per second, v1 gets ~28K messages and has the following latency profile (for a 3 second run) Latencies for: Receiver (msg=83863) percentiles (nsec): | 1.0000th=[ 1208], 5.0000th=[ 1336], 10.0000th=[ 1400], | 20.0000th=[ 1768], 30.0000th=[ 1912], 40.0000th=[ 1976], | 50.0000th=[ 2040], 60.0000th=[ 2160], 70.0000th=[ 2256], | 80.0000th=[ 2480], 90.0000th=[ 2736], 95.0000th=[ 3024], | 99.0000th=[ 4080], 99.5000th=[ 4896], 99.9000th=[ 9664], | 99.9500th=[ 17024], 99.9900th=[218112] Latencies for: Sender (msg=83863) percentiles (nsec): | 1.0000th=[ 1928], 5.0000th=[ 2064], 10.0000th=[ 2160], | 20.0000th=[ 2608], 30.0000th=[ 2672], 40.0000th=[ 2736], | 50.0000th=[ 2864], 60.0000th=[ 2960], 70.0000th=[ 3152], | 80.0000th=[ 3408], 90.0000th=[ 4128], 95.0000th=[ 4576], | 99.0000th=[ 5920], 99.5000th=[ 6752], 99.9000th=[ 13376], | 99.9500th=[ 22912], 99.9900th=[261120] and the current edition does: Latencies for: Sender (msg=94488) percentiles (nsec): | 1.0000th=[ 181], 5.0000th=[ 191], 10.0000th=[ 201], | 20.0000th=[ 215], 30.0000th=[ 225], 40.0000th=[ 235], | 50.0000th=[ 262], 60.0000th=[ 306], 70.0000th=[ 430], | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632], | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048], | 99.9500th=[19584], 99.9900th=[23680] Latencies for: Receiver (msg=94488) percentiles (nsec): | 1.0000th=[ 342], 5.0000th=[ 398], 10.0000th=[ 482], | 20.0000th=[ 652], 30.0000th=[ 812], 40.0000th=[ 972], | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944], | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216], | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048], | 99.9500th=[19840], 99.9900th=[23168] If we cap it where v1 keeps up, at 13K messages per second, v1 does: Latencies for: Receiver (msg=38820) percentiles (nsec): | 1.0000th=[ 1160], 5.0000th=[ 1256], 10.0000th=[ 1352], | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976], | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480], | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472], | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624], | 99.9500th=[ 8768], 99.9900th=[14272] Latencies for: Sender (msg=38820) percentiles (nsec): | 1.0000th=[ 1848], 5.0000th=[ 1928], 10.0000th=[ 2040], | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736], | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376], | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768], | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024], | 99.9500th=[10304], 99.9900th=[23424] and v2 does: Latencies for: Sender (msg=39005) percentiles (nsec): | 1.0000th=[ 191], 5.0000th=[ 211], 10.0000th=[ 262], | 20.0000th=[ 342], 30.0000th=[ 382], 40.0000th=[ 402], | 50.0000th=[ 450], 60.0000th=[ 532], 70.0000th=[ 1080], | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944], | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400], | 99.9500th=[26496], 99.9900th=[41728] Latencies for: Receiver (msg=39005) percentiles (nsec): | 1.0000th=[ 410], 5.0000th=[ 604], 10.0000th=[ 700], | 20.0000th=[ 900], 30.0000th=[ 1128], 40.0000th=[ 1320], | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736], | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072], | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704], | 99.9500th=[27520], 99.9900th=[35584]
On 5/29/24 02:35, Jens Axboe wrote: > On 5/28/24 5:04 PM, Jens Axboe wrote: >> On 5/28/24 12:31 PM, Jens Axboe wrote: >>> I suspect a bug in the previous patches, because this is what the >>> forward port looks like. First, for reference, the current results: >> >> Got it sorted, and pinned sender and receiver on CPUs to avoid the >> variation. It looks like this with the task_work approach that I sent >> out as v1: >> >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], >> | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], >> | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], >> | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], >> | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], >> | 99.9500th=[16320], 99.9900th=[18304] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], >> | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], >> | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], >> | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], >> | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], >> | 99.9500th=[15424], 99.9900th=[17280] >> >> and here's the exact same test run on the current patches: >> >> Latencies for: Sender >> percentiles (nsec): >> | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], >> | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], >> | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], >> | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], >> | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], >> | 99.9500th=[ 972], 99.9900th=[ 1432] >> Latencies for: Receiver >> percentiles (nsec): >> | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], >> | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], >> | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], >> | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], >> | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], >> | 99.9500th=[14784], 99.9900th=[73216] >> >> I'll try and augment the test app to do proper rated submissions, so I >> can ramp up the rates a bit and see what happens. > > And the final one, with the rated sends sorted out. One key observation > is that v1 trails the current edition, it just can't keep up as the rate > is increased. If we cap the rate at at what should be 33K messages per > second, v1 gets ~28K messages and has the following latency profile (for > a 3 second run) Do you see where the receiver latency comes from? The wakeups are quite similar in nature, assuming it's all wait(nr=1) and CPUs are not 100% consumed. The hop back spoils scheduling timing? > Latencies for: Receiver (msg=83863) > percentiles (nsec): > | 1.0000th=[ 1208], 5.0000th=[ 1336], 10.0000th=[ 1400], > | 20.0000th=[ 1768], 30.0000th=[ 1912], 40.0000th=[ 1976], > | 50.0000th=[ 2040], 60.0000th=[ 2160], 70.0000th=[ 2256], > | 80.0000th=[ 2480], 90.0000th=[ 2736], 95.0000th=[ 3024], > | 99.0000th=[ 4080], 99.5000th=[ 4896], 99.9000th=[ 9664], > | 99.9500th=[ 17024], 99.9900th=[218112] > Latencies for: Sender (msg=83863) > percentiles (nsec): > | 1.0000th=[ 1928], 5.0000th=[ 2064], 10.0000th=[ 2160], > | 20.0000th=[ 2608], 30.0000th=[ 2672], 40.0000th=[ 2736], > | 50.0000th=[ 2864], 60.0000th=[ 2960], 70.0000th=[ 3152], > | 80.0000th=[ 3408], 90.0000th=[ 4128], 95.0000th=[ 4576], > | 99.0000th=[ 5920], 99.5000th=[ 6752], 99.9000th=[ 13376], > | 99.9500th=[ 22912], 99.9900th=[261120] > > and the current edition does: > > Latencies for: Sender (msg=94488) > percentiles (nsec): > | 1.0000th=[ 181], 5.0000th=[ 191], 10.0000th=[ 201], > | 20.0000th=[ 215], 30.0000th=[ 225], 40.0000th=[ 235], > | 50.0000th=[ 262], 60.0000th=[ 306], 70.0000th=[ 430], > | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632], > | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048], > | 99.9500th=[19584], 99.9900th=[23680] > Latencies for: Receiver (msg=94488) > percentiles (nsec): > | 1.0000th=[ 342], 5.0000th=[ 398], 10.0000th=[ 482], > | 20.0000th=[ 652], 30.0000th=[ 812], 40.0000th=[ 972], > | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944], > | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216], > | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048], > | 99.9500th=[19840], 99.9900th=[23168] > > If we cap it where v1 keeps up, at 13K messages per second, v1 does: > > Latencies for: Receiver (msg=38820) > percentiles (nsec): > | 1.0000th=[ 1160], 5.0000th=[ 1256], 10.0000th=[ 1352], > | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976], > | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480], > | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472], > | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624], > | 99.9500th=[ 8768], 99.9900th=[14272] > Latencies for: Sender (msg=38820) > percentiles (nsec): > | 1.0000th=[ 1848], 5.0000th=[ 1928], 10.0000th=[ 2040], > | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736], > | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376], > | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768], > | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024], > | 99.9500th=[10304], 99.9900th=[23424] > > and v2 does: > > Latencies for: Sender (msg=39005) > percentiles (nsec): > | 1.0000th=[ 191], 5.0000th=[ 211], 10.0000th=[ 262], > | 20.0000th=[ 342], 30.0000th=[ 382], 40.0000th=[ 402], > | 50.0000th=[ 450], 60.0000th=[ 532], 70.0000th=[ 1080], > | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944], > | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400], > | 99.9500th=[26496], 99.9900th=[41728] > Latencies for: Receiver (msg=39005) > percentiles (nsec): > | 1.0000th=[ 410], 5.0000th=[ 604], 10.0000th=[ 700], > | 20.0000th=[ 900], 30.0000th=[ 1128], 40.0000th=[ 1320], > | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736], > | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072], > | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704], > | 99.9500th=[27520], 99.9900th=[35584] >
On 5/28/24 8:08 PM, Pavel Begunkov wrote: > On 5/29/24 02:35, Jens Axboe wrote: >> On 5/28/24 5:04 PM, Jens Axboe wrote: >>> On 5/28/24 12:31 PM, Jens Axboe wrote: >>>> I suspect a bug in the previous patches, because this is what the >>>> forward port looks like. First, for reference, the current results: >>> >>> Got it sorted, and pinned sender and receiver on CPUs to avoid the >>> variation. It looks like this with the task_work approach that I sent >>> out as v1: >>> >>> Latencies for: Sender >>> percentiles (nsec): >>> | 1.0000th=[ 2160], 5.0000th=[ 2672], 10.0000th=[ 2768], >>> | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600], >>> | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632], >>> | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696], >>> | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192], >>> | 99.9500th=[16320], 99.9900th=[18304] >>> Latencies for: Receiver >>> percentiles (nsec): >>> | 1.0000th=[ 1528], 5.0000th=[ 1576], 10.0000th=[ 1656], >>> | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064], >>> | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096], >>> | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160], >>> | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168], >>> | 99.9500th=[15424], 99.9900th=[17280] >>> >>> and here's the exact same test run on the current patches: >>> >>> Latencies for: Sender >>> percentiles (nsec): >>> | 1.0000th=[ 362], 5.0000th=[ 362], 10.0000th=[ 370], >>> | 20.0000th=[ 370], 30.0000th=[ 370], 40.0000th=[ 370], >>> | 50.0000th=[ 374], 60.0000th=[ 382], 70.0000th=[ 382], >>> | 80.0000th=[ 382], 90.0000th=[ 382], 95.0000th=[ 390], >>> | 99.0000th=[ 402], 99.5000th=[ 430], 99.9000th=[ 900], >>> | 99.9500th=[ 972], 99.9900th=[ 1432] >>> Latencies for: Receiver >>> percentiles (nsec): >>> | 1.0000th=[ 1528], 5.0000th=[ 1544], 10.0000th=[ 1560], >>> | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592], >>> | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608], >>> | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688], >>> | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272], >>> | 99.9500th=[14784], 99.9900th=[73216] >>> >>> I'll try and augment the test app to do proper rated submissions, so I >>> can ramp up the rates a bit and see what happens. >> >> And the final one, with the rated sends sorted out. One key observation >> is that v1 trails the current edition, it just can't keep up as the rate >> is increased. If we cap the rate at at what should be 33K messages per >> second, v1 gets ~28K messages and has the following latency profile (for >> a 3 second run) > > Do you see where the receiver latency comes from? The wakeups are > quite similar in nature, assuming it's all wait(nr=1) and CPUs > are not 100% consumed. The hop back spoils scheduling timing? I haven't dug into that side yet, but I'm guessing it's indeed scheduling artifacts. It's all doing single waits, the sender is doing io_uring_submit_and_wait(ring, 1) and the receiver is doing io_uring_wait_cqe(ring, &cqe);