mbox series

[PATCHSET,0/3] Improve MSG_RING SINGLE_ISSUER performance

Message ID 20240524230501.20178-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Improve MSG_RING SINGLE_ISSUER performance | expand

Message

Jens Axboe May 24, 2024, 10:58 p.m. UTC
Hi,

A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
generic task_work. This isn't ideal. This patchset attempts to rectify
that, taking a new approach rather than trying to use the io_uring
task_work infrastructure to handle it as in previous postings.

In a sample test app that has one thread send messages to another and
logging both the time between sender sending and receiver receving and
just the time for the sender to post a message and get the CQE back,
I see the following sender latencies with the stock kernel:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 4384],  5.0000th=[ 4512], 10.0000th=[ 4576],
     | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024],
     | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280],
     | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
     | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376],
     | 99.9500th=[26496], 99.9900th=[59136]

and with the patches:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  756],  5.0000th=[  820], 10.0000th=[  828],
     | 20.0000th=[  844], 30.0000th=[  852], 40.0000th=[  852],
     | 50.0000th=[  860], 60.0000th=[  860], 70.0000th=[  868],
     | 80.0000th=[  884], 90.0000th=[  964], 95.0000th=[  988],
     | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544],
     | 99.9500th=[ 1944], 99.9900th=[ 2896]

For the receiving side the win is smaller as it only "suffers" from
a single generic task_work, about a 10% win in latencies there.

The idea here is to utilize the CQE overflow infrastructure for this,
as that allows the right task to post the CQE to the ring.

1 is a basic refactoring prep patch, patch 2 adds support for normal
messages, and patch 3 adopts the same approach for fd passing.

 io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 138 insertions(+), 13 deletions(-)

Comments

Pavel Begunkov May 28, 2024, 1:31 p.m. UTC | #1
On 5/24/24 23:58, Jens Axboe wrote:
> Hi,
> 
> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to

IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
specifically an IORING_SETUP_DEFER_TASKRUN optimisation.

> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
> generic task_work. This isn't ideal. This patchset attempts to rectify
> that, taking a new approach rather than trying to use the io_uring
> task_work infrastructure to handle it as in previous postings.

Not sure why you'd want to piggyback onto overflows, it's not
such a well made and reliable infra, whereas the DEFER_TASKRUN
part of the task_work approach was fine.

The completion path doesn't usually look at the overflow list
but on cached cqe pointers showing the CQ is full, that means
after you queue an overflow someone may post a CQE in the CQ
in the normal path and you get reordering. Not that bad
considering it's from another ring, but a bit nasty and surely
will bite us back in the future, it always does.

That's assuming you decide io_msg_need_remote() solely based
->task_complete and remove

	return current != target_ctx->submitter_task;

otherwise you can get two linked msg_ring target CQEs reordered.

It's also duplicating that crappy overflow code nobody cares
much about, and it's still a forced wake up with no batching,
circumventing the normal wake up path, i.e. io_uring tw.

I don't think we should care about the request completion
latency (sender latency), people should be more interested
in the reaction speed (receiver latency), but if you care
about it for a reason, perhaps you can just as well allocate
a new request and complete the previous one right away.

> In a sample test app that has one thread send messages to another and
> logging both the time between sender sending and receiver receving and
> just the time for the sender to post a message and get the CQE back,
> I see the following sender latencies with the stock kernel:
> 
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[ 4384],  5.0000th=[ 4512], 10.0000th=[ 4576],
>       | 20.0000th=[ 4768], 30.0000th=[ 4896], 40.0000th=[ 5024],
>       | 50.0000th=[ 5088], 60.0000th=[ 5152], 70.0000th=[ 5280],
>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>       | 99.0000th=[ 8032], 99.5000th=[18048], 99.9000th=[21376],
>       | 99.9500th=[26496], 99.9900th=[59136]
> 
> and with the patches:
> 
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[  756],  5.0000th=[  820], 10.0000th=[  828],
>       | 20.0000th=[  844], 30.0000th=[  852], 40.0000th=[  852],
>       | 50.0000th=[  860], 60.0000th=[  860], 70.0000th=[  868],
>       | 80.0000th=[  884], 90.0000th=[  964], 95.0000th=[  988],
>       | 99.0000th=[ 1128], 99.5000th=[ 1208], 99.9000th=[ 1544],
>       | 99.9500th=[ 1944], 99.9900th=[ 2896]
> 
> For the receiving side the win is smaller as it only "suffers" from
> a single generic task_work, about a 10% win in latencies there.
> 
> The idea here is to utilize the CQE overflow infrastructure for this,
> as that allows the right task to post the CQE to the ring.
> 
> 1 is a basic refactoring prep patch, patch 2 adds support for normal
> messages, and patch 3 adopts the same approach for fd passing.
> 
>   io_uring/msg_ring.c | 151 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 138 insertions(+), 13 deletions(-)
>
Jens Axboe May 28, 2024, 2:34 p.m. UTC | #2
On 5/28/24 7:31 AM, Pavel Begunkov wrote:
> On 5/24/24 23:58, Jens Axboe wrote:
>> Hi,
>>
>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
> 
> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.

Right, I should change that in the commit message. It's task_complete
driving it, which is tied to DEFER_TASKRUN.

>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>> generic task_work. This isn't ideal. This patchset attempts to rectify
>> that, taking a new approach rather than trying to use the io_uring
>> task_work infrastructure to handle it as in previous postings.
> 
> Not sure why you'd want to piggyback onto overflows, it's not
> such a well made and reliable infra, whereas the DEFER_TASKRUN
> part of the task_work approach was fine.

It's not right now, because it's really a "don't get into this
condition, if you do, things are slower". And this series doesn't really
change that, and honestly it doesn't even need to. It's still way better
than what we had before in terms of DEFER_TASKRUN and messages.

We could certainly make the messages a subset of real overflow if we
wanted, but honestly it looks decent enough as-is with the changes that
I'm not hugely motivated to do that.

> The completion path doesn't usually look at the overflow list
> but on cached cqe pointers showing the CQ is full, that means
> after you queue an overflow someone may post a CQE in the CQ
> in the normal path and you get reordering. Not that bad
> considering it's from another ring, but a bit nasty and surely
> will bite us back in the future, it always does.

This is true, but generally true as well as completions come in async.
You don't get to control the exact order on a remote ring. Messages
themselves will be ordered when posted, which should be the important
aspect here. Order with locally posted completions I don't see why you'd
care, that's a timing game that you cannot control.

> That's assuming you decide io_msg_need_remote() solely based
> ->task_complete and remove
> 
>     return current != target_ctx->submitter_task;
> 
> otherwise you can get two linked msg_ring target CQEs reordered.

Good point, since it'd now be cheap enough, would probably make sense to
simply gate it on task_complete alone. I even toyed with the idea of
just using this approach for any ring type and kill some code in there,
but didn't venture that far yet.

> It's also duplicating that crappy overflow code nobody cares
> much about, and it's still a forced wake up with no batching,
> circumventing the normal wake up path, i.e. io_uring tw.

Yes, since this is v1 I didn't bother to integrate more tightly with the
generic overflows, that should obviously be done by first adding a
helper for this. I consider that pretty minor.

> I don't think we should care about the request completion
> latency (sender latency), people should be more interested
> in the reaction speed (receiver latency), but if you care
> about it for a reason, perhaps you can just as well allocate
> a new request and complete the previous one right away.

I know the numbers I posted was just sender side improvements on that
particular box, however that isn't really the case for others. Here's on
an another test box:

axboe@r7525 ~> ./msg-lat
init_flags=3000
Wait on startup
802775: my fd 3, other 4
802776: my fd 4, other 3
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
     | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
     | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
     | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
     | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
     | 99.9500th=[20352], 99.9900th=[28288]
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
     | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
     | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
     | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
     | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
     | 99.9500th=[28032], 99.9900th=[40192]

and after:

axboe@r7525 ~> ./msg-lat                                                       1.776s
init_flags=3000
Wait on startup
3767: my fd 3, other 4
3768: my fd 4, other 3
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
     | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
     | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
     | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
     | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
     | 99.9500th=[ 1976], 99.9900th=[ 3408]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
     | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
     | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
     | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
     | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
     | 99.9500th=[21632], 99.9900th=[58624]

Obivously some variation in runs in general, but it's most certainly
faster in terms of receiving too. This test case is fixed at doing 100K
messages per second, didn't do any peak testing. But I strongly suspect
you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
task_work is pretty sucky imho.

You could just make it io_kiocb based, but I did not want to get into
foreign requests on remote rings. What would you envision with that
approach, using our normal ring task_work for this instead? That would
be an approach, obviously this one took a different path from the
previous task_work driven approach.
Jens Axboe May 28, 2024, 2:39 p.m. UTC | #3
On 5/28/24 8:34 AM, Jens Axboe wrote:
> You could just make it io_kiocb based, but I did not want to get into
> foreign requests on remote rings. What would you envision with that
> approach, using our normal ring task_work for this instead? That would
> be an approach, obviously this one took a different path from the
> previous task_work driven approach.

Foreign io_kiocb itself isn't even enough, you'd need an overflow cqe
allocated upfront too for posting directly without waiting for the
reply. Basically it needs to be confident it'll post successfully on the
remote ring, otherwise we cannot complete the source msg ring sqe
inline.
Jens Axboe May 28, 2024, 3:27 p.m. UTC | #4
On 5/28/24 8:34 AM, Jens Axboe wrote:
> Obivously some variation in runs in general, but it's most certainly
> faster in terms of receiving too. This test case is fixed at doing 100K
> messages per second, didn't do any peak testing. But I strongly suspect
> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
> task_work is pretty sucky imho.

Example on that same box. Running the same message passer test, sys time
used for a 3 second run varies between 300ms and 850ms on the stock
kernel (with a lot of variance), after it's around ~150ms and pretty
steady.
Pavel Begunkov May 28, 2024, 4:50 p.m. UTC | #5
On 5/28/24 15:34, Jens Axboe wrote:
> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>> On 5/24/24 23:58, Jens Axboe wrote:
>>> Hi,
>>>
>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>
>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
> 
> Right, I should change that in the commit message. It's task_complete
> driving it, which is tied to DEFER_TASKRUN.
> 
>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>> that, taking a new approach rather than trying to use the io_uring
>>> task_work infrastructure to handle it as in previous postings.
>>
>> Not sure why you'd want to piggyback onto overflows, it's not
>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>> part of the task_work approach was fine.
> 
> It's not right now, because it's really a "don't get into this
> condition, if you do, things are slower". And this series doesn't really
> change that, and honestly it doesn't even need to. It's still way better
> than what we had before in terms of DEFER_TASKRUN and messages.

Better than how it is now or comparing to the previous attempt?
I think the one using io_uring's tw infra was better, which is
where all wake ups and optimisations currently consolidate.

> We could certainly make the messages a subset of real overflow if we
> wanted, but honestly it looks decent enough as-is with the changes that
> I'm not hugely motivated to do that.

Not sure what you mean here, but not really suggesting refactoring
overflows. Taking the tw based patches should be easy, it only
needs killing !DEFER_TASKRUN changes from there.


>> The completion path doesn't usually look at the overflow list
>> but on cached cqe pointers showing the CQ is full, that means
>> after you queue an overflow someone may post a CQE in the CQ
>> in the normal path and you get reordering. Not that bad
>> considering it's from another ring, but a bit nasty and surely
>> will bite us back in the future, it always does.
> 
> This is true, but generally true as well as completions come in async.
> You don't get to control the exact order on a remote ring. Messages
> themselves will be ordered when posted, which should be the important
> aspect here. Order with locally posted completions I don't see why you'd
> care, that's a timing game that you cannot control.

True for a single request, but in a more complex system
sender's ordering will affect the order on the receiving side.

ring1: msg_ring(); write(pipe)
ring2: read(pipe)

The user would definitely think that the other ring will
first get a msg_ring CQE and then an CQE from the read, but as
always it's hard to predict all repercussions.

>> That's assuming you decide io_msg_need_remote() solely based
>> ->task_complete and remove
>>
>>      return current != target_ctx->submitter_task;
>>
>> otherwise you can get two linked msg_ring target CQEs reordered.
> 
> Good point, since it'd now be cheap enough, would probably make sense to
> simply gate it on task_complete alone. I even toyed with the idea of
> just using this approach for any ring type and kill some code in there,
> but didn't venture that far yet.

That task check doesn't make any real difference. If it's the
same thread you can skip io_uring all together.

>> It's also duplicating that crappy overflow code nobody cares
>> much about, and it's still a forced wake up with no batching,
>> circumventing the normal wake up path, i.e. io_uring tw.
> 
> Yes, since this is v1 I didn't bother to integrate more tightly with the
> generic overflows, that should obviously be done by first adding a
> helper for this. I consider that pretty minor.

My problem is not duplication of code base but rather
extending the internal user base of it. You can say that
msg_ring can easily become a hot path for some, and
then we'll be putting efforts both into overflows and
task_work when in essence they solve quite a similar
problem here.

>> I don't think we should care about the request completion
>> latency (sender latency), people should be more interested
>> in the reaction speed (receiver latency), but if you care
>> about it for a reason, perhaps you can just as well allocate
>> a new request and complete the previous one right away.
> 
> I know the numbers I posted was just sender side improvements on that
> particular box, however that isn't really the case for others. Here's on
> an another test box:
> 
> axboe@r7525 ~> ./msg-lat
> init_flags=3000
> Wait on startup
> 802775: my fd 3, other 4
> 802776: my fd 4, other 3
> Latencies for: Receiver
>      percentiles (nsec):
>       |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
>       | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
>       | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>       | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
>       | 99.9500th=[20352], 99.9900th=[28288]
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
>       | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
>       | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
>       | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
>       | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
>       | 99.9500th=[28032], 99.9900th=[40192]
> 
> and after:
> 
> axboe@r7525 ~> ./msg-lat                                                       1.776s
> init_flags=3000
> Wait on startup
> 3767: my fd 3, other 4
> 3768: my fd 4, other 3
> Latencies for: Sender
>      percentiles (nsec):
>       |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
>       | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
>       | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
>       | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
>       | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
>       | 99.9500th=[ 1976], 99.9900th=[ 3408]
> Latencies for: Receiver
>      percentiles (nsec):
>       |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
>       | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
>       | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
>       | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
>       | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
>       | 99.9500th=[21632], 99.9900th=[58624]
> 
> Obivously some variation in runs in general, but it's most certainly
> faster in terms of receiving too. This test case is fixed at doing 100K
> messages per second, didn't do any peak testing. But I strongly suspect
> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
> task_work is pretty sucky imho.

Sure, you mentioned wins on the receiver side, I consider it
to be the main merit (latency and throughput)

> You could just make it io_kiocb based, but I did not want to get into
> foreign requests on remote rings. What would you envision with that
> approach, using our normal ring task_work for this instead? That would

It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care
about it because it just does it all under ->completion_lock, which
is why you shouldn't have ever hit the problem in testing.

> be an approach, obviously this one took a different path from the
> previous task_work driven approach.
Jens Axboe May 28, 2024, 6:07 p.m. UTC | #6
On 5/28/24 10:50 AM, Pavel Begunkov wrote:
> On 5/28/24 15:34, Jens Axboe wrote:
>> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>>
>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
>>
>> Right, I should change that in the commit message. It's task_complete
>> driving it, which is tied to DEFER_TASKRUN.
>>
>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>>> that, taking a new approach rather than trying to use the io_uring
>>>> task_work infrastructure to handle it as in previous postings.
>>>
>>> Not sure why you'd want to piggyback onto overflows, it's not
>>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>>> part of the task_work approach was fine.
>>
>> It's not right now, because it's really a "don't get into this
>> condition, if you do, things are slower". And this series doesn't really
>> change that, and honestly it doesn't even need to. It's still way better
>> than what we had before in terms of DEFER_TASKRUN and messages.
> 
> Better than how it is now or comparing to the previous attempt?
> I think the one using io_uring's tw infra was better, which is
> where all wake ups and optimisations currently consolidate.

Better than both - I haven't tested with the previous version, but I can
certainly do that. The reason why I think it'll be better is that it
avoids the double roundtrips. Yes v1 was using normal task_work which is
better, but it didn't solve what I think is the fundamental issue here.

I'll forward port it and give it a spin, then we'll know.

>> We could certainly make the messages a subset of real overflow if we
>> wanted, but honestly it looks decent enough as-is with the changes that
>> I'm not hugely motivated to do that.
> 
> Not sure what you mean here, but not really suggesting refactoring
> overflows. Taking the tw based patches should be easy, it only
> needs killing !DEFER_TASKRUN changes from there.

Sorry wasn't clear, the refactoring was purely my suggestion to reduce a
bit of the code duplication.

>>> The completion path doesn't usually look at the overflow list
>>> but on cached cqe pointers showing the CQ is full, that means
>>> after you queue an overflow someone may post a CQE in the CQ
>>> in the normal path and you get reordering. Not that bad
>>> considering it's from another ring, but a bit nasty and surely
>>> will bite us back in the future, it always does.
>>
>> This is true, but generally true as well as completions come in async.
>> You don't get to control the exact order on a remote ring. Messages
>> themselves will be ordered when posted, which should be the important
>> aspect here. Order with locally posted completions I don't see why you'd
>> care, that's a timing game that you cannot control.
> 
> True for a single request, but in a more complex system
> sender's ordering will affect the order on the receiving side.
> 
> ring1: msg_ring(); write(pipe)
> ring2: read(pipe)
> 
> The user would definitely think that the other ring will
> first get a msg_ring CQE and then an CQE from the read, but as
> always it's hard to predict all repercussions.

Nobody should be making assumptions like that, and that will in fact
already not be the case. If the msg_ring fails to lock the remote ring,
then it may very well end up in the hands of io-wq. And then you could
get either result validly, msg CQE first or last.

>>> That's assuming you decide io_msg_need_remote() solely based
>>> ->task_complete and remove
>>>
>>>      return current != target_ctx->submitter_task;
>>>
>>> otherwise you can get two linked msg_ring target CQEs reordered.
>>
>> Good point, since it'd now be cheap enough, would probably make sense to
>> simply gate it on task_complete alone. I even toyed with the idea of
>> just using this approach for any ring type and kill some code in there,
>> but didn't venture that far yet.
> 
> That task check doesn't make any real difference. If it's the
> same thread you can skip io_uring all together.

Yeah agree, I did add a patch in between after the last email to just
remove the task check. It's not really a useful case to attempt to cater
to in particular.

>>> It's also duplicating that crappy overflow code nobody cares
>>> much about, and it's still a forced wake up with no batching,
>>> circumventing the normal wake up path, i.e. io_uring tw.
>>
>> Yes, since this is v1 I didn't bother to integrate more tightly with the
>> generic overflows, that should obviously be done by first adding a
>> helper for this. I consider that pretty minor.
> 
> My problem is not duplication of code base but rather
> extending the internal user base of it. You can say that
> msg_ring can easily become a hot path for some, and
> then we'll be putting efforts both into overflows and
> task_work when in essence they solve quite a similar
> problem here.

That's why I was tempted to remove the task_work path for messages on
top of this. But I agree on the wakeup side, that's obviously something
that doesn't currently work with msg ring. And while I don't see that as
a hard problem to solve, it is a bit annoying to have multiple sources
for that. Would naturally be better to retain just the task_work side.

For one use case that I think is interesting with messages is work
passing, eliminating a user side data structure (and lock) for that and
side channel wakeups, I don't think the wakeup batching matters as
you're generally not going to be batching receiving work. You're either
already running work for processing, or sleeping waiting for one.

>>> I don't think we should care about the request completion
>>> latency (sender latency), people should be more interested
>>> in the reaction speed (receiver latency), but if you care
>>> about it for a reason, perhaps you can just as well allocate
>>> a new request and complete the previous one right away.
>>
>> I know the numbers I posted was just sender side improvements on that
>> particular box, however that isn't really the case for others. Here's on
>> an another test box:
>>
>> axboe@r7525 ~> ./msg-lat
>> init_flags=3000
>> Wait on startup
>> 802775: my fd 3, other 4
>> 802776: my fd 4, other 3
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 4192],  5.0000th=[ 4320], 10.0000th=[ 4448],
>>       | 20.0000th=[ 4576], 30.0000th=[ 4704], 40.0000th=[ 4832],
>>       | 50.0000th=[ 4960], 60.0000th=[ 5088], 70.0000th=[ 5216],
>>       | 80.0000th=[ 5344], 90.0000th=[ 5536], 95.0000th=[ 5728],
>>       | 99.0000th=[ 6176], 99.5000th=[ 7136], 99.9000th=[19584],
>>       | 99.9500th=[20352], 99.9900th=[28288]
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[ 6560],  5.0000th=[ 6880], 10.0000th=[ 7008],
>>       | 20.0000th=[ 7264], 30.0000th=[ 7456], 40.0000th=[ 7712],
>>       | 50.0000th=[ 8032], 60.0000th=[ 8256], 70.0000th=[ 8512],
>>       | 80.0000th=[ 8640], 90.0000th=[ 8896], 95.0000th=[ 9152],
>>       | 99.0000th=[ 9792], 99.5000th=[11584], 99.9000th=[23168],
>>       | 99.9500th=[28032], 99.9900th=[40192]
>>
>> and after:
>>
>> axboe@r7525 ~> ./msg-lat                                                       1.776s
>> init_flags=3000
>> Wait on startup
>> 3767: my fd 3, other 4
>> 3768: my fd 4, other 3
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
>>       | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
>>       | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
>>       | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
>>       | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
>>       | 99.9500th=[ 1976], 99.9900th=[ 3408]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
>>       | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
>>       | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
>>       | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
>>       | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
>>       | 99.9500th=[21632], 99.9900th=[58624]
>>
>> Obivously some variation in runs in general, but it's most certainly
>> faster in terms of receiving too. This test case is fixed at doing 100K
>> messages per second, didn't do any peak testing. But I strongly suspect
>> you'll see very nice efficiency gains here too, as doing two TWA_SIGNAL
>> task_work is pretty sucky imho.
> 
> Sure, you mentioned wins on the receiver side, I consider it
> to be the main merit (latency and throughput)

Actually I think both are interesting - not because the sender side is
latency sensitive for on receving the CQE for it, but because it goes
hand in hand with a reduction in cycles spent sending that work. That's
the win on the sender side, more so than the latency win. The latter is
just gravy on top.

>> You could just make it io_kiocb based, but I did not want to get into
>> foreign requests on remote rings. What would you envision with that
>> approach, using our normal ring task_work for this instead? That would
> 
> It was buggy in the !DEFER_TASKRUN path. Fortunately, you don't care
> about it because it just does it all under ->completion_lock, which
> is why you shouldn't have ever hit the problem in testing.

I'll test the old approach and we'll see where we are at.
Jens Axboe May 28, 2024, 6:31 p.m. UTC | #7
On 5/28/24 12:07 PM, Jens Axboe wrote:
> On 5/28/24 10:50 AM, Pavel Begunkov wrote:
>> On 5/28/24 15:34, Jens Axboe wrote:
>>> On 5/28/24 7:31 AM, Pavel Begunkov wrote:
>>>> On 5/24/24 23:58, Jens Axboe wrote:
>>>>> Hi,
>>>>>
>>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to
>>>>
>>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's
>>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation.
>>>
>>> Right, I should change that in the commit message. It's task_complete
>>> driving it, which is tied to DEFER_TASKRUN.
>>>
>>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through
>>>>> generic task_work. This isn't ideal. This patchset attempts to rectify
>>>>> that, taking a new approach rather than trying to use the io_uring
>>>>> task_work infrastructure to handle it as in previous postings.
>>>>
>>>> Not sure why you'd want to piggyback onto overflows, it's not
>>>> such a well made and reliable infra, whereas the DEFER_TASKRUN
>>>> part of the task_work approach was fine.
>>>
>>> It's not right now, because it's really a "don't get into this
>>> condition, if you do, things are slower". And this series doesn't really
>>> change that, and honestly it doesn't even need to. It's still way better
>>> than what we had before in terms of DEFER_TASKRUN and messages.
>>
>> Better than how it is now or comparing to the previous attempt?
>> I think the one using io_uring's tw infra was better, which is
>> where all wake ups and optimisations currently consolidate.
> 
> Better than both - I haven't tested with the previous version, but I can
> certainly do that. The reason why I think it'll be better is that it
> avoids the double roundtrips. Yes v1 was using normal task_work which is
> better, but it didn't solve what I think is the fundamental issue here.
> 
> I'll forward port it and give it a spin, then we'll know.

I suspect a bug in the previous patches, because this is what the
forward port looks like. First, for reference, the current results:

init_flags=3000
Wait on startup
3767: my fd 3, other 4
3768: my fd 4, other 3
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  740],  5.0000th=[  748], 10.0000th=[  756],
     | 20.0000th=[  764], 30.0000th=[  764], 40.0000th=[  772],
     | 50.0000th=[  772], 60.0000th=[  780], 70.0000th=[  780],
     | 80.0000th=[  860], 90.0000th=[  892], 95.0000th=[  900],
     | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656],
     | 99.9500th=[ 1976], 99.9900th=[ 3408]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 2736],  5.0000th=[ 2736], 10.0000th=[ 2768],
     | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800],
     | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896],
     | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120],
     | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560],
     | 99.9500th=[21632], 99.9900th=[58624]

and here's with io_uring-msg_ring.1, which is just a straight forward
forward port of the previous patches on the same base as v2:

init_flags=3000
Wait on startup
4097: my fd 4, other 3
4096: my fd 3, other 4
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 5920],  5.0000th=[ 5920], 10.0000th=[ 5984],
     | 20.0000th=[ 5984], 30.0000th=[ 6048], 40.0000th=[ 6048],
     | 50.0000th=[ 6112], 60.0000th=[ 6304], 70.0000th=[ 6368],
     | 80.0000th=[ 6560], 90.0000th=[ 6880], 95.0000th=[ 7072],
     | 99.0000th=[ 7456], 99.5000th=[ 7712], 99.9000th=[ 8640],
     | 99.9500th=[10432], 99.9900th=[26240]
Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 9536],  5.0000th=[ 9664], 10.0000th=[ 9664],
     | 20.0000th=[ 9920], 30.0000th=[ 9920], 40.0000th=[10048],
     | 50.0000th=[10176], 60.0000th=[10304], 70.0000th=[10432],
     | 80.0000th=[10688], 90.0000th=[10944], 95.0000th=[11328],
     | 99.0000th=[11840], 99.5000th=[12096], 99.9000th=[13888],
     | 99.9500th=[15424], 99.9900th=[34560]
Jens Axboe May 28, 2024, 11:04 p.m. UTC | #8
On 5/28/24 12:31 PM, Jens Axboe wrote:
> I suspect a bug in the previous patches, because this is what the
> forward port looks like. First, for reference, the current results:

Got it sorted, and pinned sender and receiver on CPUs to avoid the
variation. It looks like this with the task_work approach that I sent
out as v1:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
     | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
     | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
     | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
     | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
     | 99.9500th=[16320], 99.9900th=[18304]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
     | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
     | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
     | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
     | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
     | 99.9500th=[15424], 99.9900th=[17280]

and here's the exact same test run on the current patches:

Latencies for: Sender
    percentiles (nsec):
     |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
     | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
     | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
     | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
     | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
     | 99.9500th=[  972], 99.9900th=[ 1432]
Latencies for: Receiver
    percentiles (nsec):
     |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
     | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
     | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
     | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
     | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
     | 99.9500th=[14784], 99.9900th=[73216]

I'll try and augment the test app to do proper rated submissions, so I
can ramp up the rates a bit and see what happens.
Jens Axboe May 29, 2024, 1:35 a.m. UTC | #9
On 5/28/24 5:04 PM, Jens Axboe wrote:
> On 5/28/24 12:31 PM, Jens Axboe wrote:
>> I suspect a bug in the previous patches, because this is what the
>> forward port looks like. First, for reference, the current results:
> 
> Got it sorted, and pinned sender and receiver on CPUs to avoid the
> variation. It looks like this with the task_work approach that I sent
> out as v1:
> 
> Latencies for: Sender
>     percentiles (nsec):
>      |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>      | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>      | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>      | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>      | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>      | 99.9500th=[16320], 99.9900th=[18304]
> Latencies for: Receiver
>     percentiles (nsec):
>      |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>      | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>      | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>      | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>      | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>      | 99.9500th=[15424], 99.9900th=[17280]
> 
> and here's the exact same test run on the current patches:
> 
> Latencies for: Sender
>     percentiles (nsec):
>      |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>      | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>      | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>      | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>      | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>      | 99.9500th=[  972], 99.9900th=[ 1432]
> Latencies for: Receiver
>     percentiles (nsec):
>      |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>      | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>      | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>      | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>      | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>      | 99.9500th=[14784], 99.9900th=[73216]
> 
> I'll try and augment the test app to do proper rated submissions, so I
> can ramp up the rates a bit and see what happens.

And the final one, with the rated sends sorted out. One key observation
is that v1 trails the current edition, it just can't keep up as the rate
is increased. If we cap the rate at at what should be 33K messages per
second, v1 gets ~28K messages and has the following latency profile (for
a 3 second run)

Latencies for: Receiver (msg=83863)
    percentiles (nsec):
     |  1.0000th=[  1208],  5.0000th=[  1336], 10.0000th=[  1400],
     | 20.0000th=[  1768], 30.0000th=[  1912], 40.0000th=[  1976],
     | 50.0000th=[  2040], 60.0000th=[  2160], 70.0000th=[  2256],
     | 80.0000th=[  2480], 90.0000th=[  2736], 95.0000th=[  3024],
     | 99.0000th=[  4080], 99.5000th=[  4896], 99.9000th=[  9664],
     | 99.9500th=[ 17024], 99.9900th=[218112]
Latencies for: Sender (msg=83863)
    percentiles (nsec):
     |  1.0000th=[  1928],  5.0000th=[  2064], 10.0000th=[  2160],
     | 20.0000th=[  2608], 30.0000th=[  2672], 40.0000th=[  2736],
     | 50.0000th=[  2864], 60.0000th=[  2960], 70.0000th=[  3152],
     | 80.0000th=[  3408], 90.0000th=[  4128], 95.0000th=[  4576],
     | 99.0000th=[  5920], 99.5000th=[  6752], 99.9000th=[ 13376],
     | 99.9500th=[ 22912], 99.9900th=[261120]

and the current edition does:

Latencies for: Sender (msg=94488)
    percentiles (nsec):
     |  1.0000th=[  181],  5.0000th=[  191], 10.0000th=[  201],
     | 20.0000th=[  215], 30.0000th=[  225], 40.0000th=[  235],
     | 50.0000th=[  262], 60.0000th=[  306], 70.0000th=[  430],
     | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632],
     | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048],
     | 99.9500th=[19584], 99.9900th=[23680]
Latencies for: Receiver (msg=94488)
    percentiles (nsec):
     |  1.0000th=[  342],  5.0000th=[  398], 10.0000th=[  482],
     | 20.0000th=[  652], 30.0000th=[  812], 40.0000th=[  972],
     | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944],
     | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216],
     | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048],
     | 99.9500th=[19840], 99.9900th=[23168]

If we cap it where v1 keeps up, at 13K messages per second, v1 does:

Latencies for: Receiver (msg=38820)
    percentiles (nsec):
     |  1.0000th=[ 1160],  5.0000th=[ 1256], 10.0000th=[ 1352],
     | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976],
     | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480],
     | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472],
     | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624],
     | 99.9500th=[ 8768], 99.9900th=[14272]
Latencies for: Sender (msg=38820)
    percentiles (nsec):
     |  1.0000th=[ 1848],  5.0000th=[ 1928], 10.0000th=[ 2040],
     | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736],
     | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376],
     | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768],
     | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024],
     | 99.9500th=[10304], 99.9900th=[23424]

and v2 does:

Latencies for: Sender (msg=39005)
    percentiles (nsec):
     |  1.0000th=[  191],  5.0000th=[  211], 10.0000th=[  262],
     | 20.0000th=[  342], 30.0000th=[  382], 40.0000th=[  402],
     | 50.0000th=[  450], 60.0000th=[  532], 70.0000th=[ 1080],
     | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944],
     | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400],
     | 99.9500th=[26496], 99.9900th=[41728]
Latencies for: Receiver (msg=39005)
    percentiles (nsec):
     |  1.0000th=[  410],  5.0000th=[  604], 10.0000th=[  700],
     | 20.0000th=[  900], 30.0000th=[ 1128], 40.0000th=[ 1320],
     | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736],
     | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072],
     | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704],
     | 99.9500th=[27520], 99.9900th=[35584]
Pavel Begunkov May 29, 2024, 2:08 a.m. UTC | #10
On 5/29/24 02:35, Jens Axboe wrote:
> On 5/28/24 5:04 PM, Jens Axboe wrote:
>> On 5/28/24 12:31 PM, Jens Axboe wrote:
>>> I suspect a bug in the previous patches, because this is what the
>>> forward port looks like. First, for reference, the current results:
>>
>> Got it sorted, and pinned sender and receiver on CPUs to avoid the
>> variation. It looks like this with the task_work approach that I sent
>> out as v1:
>>
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>>       | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>>       | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>>       | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>>       | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>>       | 99.9500th=[16320], 99.9900th=[18304]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>>       | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>>       | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>>       | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>>       | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>>       | 99.9500th=[15424], 99.9900th=[17280]
>>
>> and here's the exact same test run on the current patches:
>>
>> Latencies for: Sender
>>      percentiles (nsec):
>>       |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>>       | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>>       | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>>       | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>>       | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>>       | 99.9500th=[  972], 99.9900th=[ 1432]
>> Latencies for: Receiver
>>      percentiles (nsec):
>>       |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>>       | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>>       | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>>       | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>>       | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>>       | 99.9500th=[14784], 99.9900th=[73216]
>>
>> I'll try and augment the test app to do proper rated submissions, so I
>> can ramp up the rates a bit and see what happens.
> 
> And the final one, with the rated sends sorted out. One key observation
> is that v1 trails the current edition, it just can't keep up as the rate
> is increased. If we cap the rate at at what should be 33K messages per
> second, v1 gets ~28K messages and has the following latency profile (for
> a 3 second run)

Do you see where the receiver latency comes from? The wakeups are
quite similar in nature, assuming it's all wait(nr=1) and CPUs
are not 100% consumed. The hop back spoils scheduling timing?


> Latencies for: Receiver (msg=83863)
>      percentiles (nsec):
>       |  1.0000th=[  1208],  5.0000th=[  1336], 10.0000th=[  1400],
>       | 20.0000th=[  1768], 30.0000th=[  1912], 40.0000th=[  1976],
>       | 50.0000th=[  2040], 60.0000th=[  2160], 70.0000th=[  2256],
>       | 80.0000th=[  2480], 90.0000th=[  2736], 95.0000th=[  3024],
>       | 99.0000th=[  4080], 99.5000th=[  4896], 99.9000th=[  9664],
>       | 99.9500th=[ 17024], 99.9900th=[218112]
> Latencies for: Sender (msg=83863)
>      percentiles (nsec):
>       |  1.0000th=[  1928],  5.0000th=[  2064], 10.0000th=[  2160],
>       | 20.0000th=[  2608], 30.0000th=[  2672], 40.0000th=[  2736],
>       | 50.0000th=[  2864], 60.0000th=[  2960], 70.0000th=[  3152],
>       | 80.0000th=[  3408], 90.0000th=[  4128], 95.0000th=[  4576],
>       | 99.0000th=[  5920], 99.5000th=[  6752], 99.9000th=[ 13376],
>       | 99.9500th=[ 22912], 99.9900th=[261120]
> 
> and the current edition does:
> 
> Latencies for: Sender (msg=94488)
>      percentiles (nsec):
>       |  1.0000th=[  181],  5.0000th=[  191], 10.0000th=[  201],
>       | 20.0000th=[  215], 30.0000th=[  225], 40.0000th=[  235],
>       | 50.0000th=[  262], 60.0000th=[  306], 70.0000th=[  430],
>       | 80.0000th=[ 1004], 90.0000th=[ 2480], 95.0000th=[ 3632],
>       | 99.0000th=[ 8096], 99.5000th=[12352], 99.9000th=[18048],
>       | 99.9500th=[19584], 99.9900th=[23680]
> Latencies for: Receiver (msg=94488)
>      percentiles (nsec):
>       |  1.0000th=[  342],  5.0000th=[  398], 10.0000th=[  482],
>       | 20.0000th=[  652], 30.0000th=[  812], 40.0000th=[  972],
>       | 50.0000th=[ 1240], 60.0000th=[ 1640], 70.0000th=[ 1944],
>       | 80.0000th=[ 2448], 90.0000th=[ 3248], 95.0000th=[ 5216],
>       | 99.0000th=[10304], 99.5000th=[12352], 99.9000th=[18048],
>       | 99.9500th=[19840], 99.9900th=[23168]
> 
> If we cap it where v1 keeps up, at 13K messages per second, v1 does:
> 
> Latencies for: Receiver (msg=38820)
>      percentiles (nsec):
>       |  1.0000th=[ 1160],  5.0000th=[ 1256], 10.0000th=[ 1352],
>       | 20.0000th=[ 1688], 30.0000th=[ 1928], 40.0000th=[ 1976],
>       | 50.0000th=[ 2064], 60.0000th=[ 2384], 70.0000th=[ 2480],
>       | 80.0000th=[ 2768], 90.0000th=[ 3280], 95.0000th=[ 3472],
>       | 99.0000th=[ 4192], 99.5000th=[ 4512], 99.9000th=[ 6624],
>       | 99.9500th=[ 8768], 99.9900th=[14272]
> Latencies for: Sender (msg=38820)
>      percentiles (nsec):
>       |  1.0000th=[ 1848],  5.0000th=[ 1928], 10.0000th=[ 2040],
>       | 20.0000th=[ 2608], 30.0000th=[ 2640], 40.0000th=[ 2736],
>       | 50.0000th=[ 3024], 60.0000th=[ 3120], 70.0000th=[ 3376],
>       | 80.0000th=[ 3824], 90.0000th=[ 4512], 95.0000th=[ 4768],
>       | 99.0000th=[ 5536], 99.5000th=[ 6048], 99.9000th=[ 9024],
>       | 99.9500th=[10304], 99.9900th=[23424]
> 
> and v2 does:
> 
> Latencies for: Sender (msg=39005)
>      percentiles (nsec):
>       |  1.0000th=[  191],  5.0000th=[  211], 10.0000th=[  262],
>       | 20.0000th=[  342], 30.0000th=[  382], 40.0000th=[  402],
>       | 50.0000th=[  450], 60.0000th=[  532], 70.0000th=[ 1080],
>       | 80.0000th=[ 1848], 90.0000th=[ 4768], 95.0000th=[10944],
>       | 99.0000th=[16512], 99.5000th=[18304], 99.9000th=[22400],
>       | 99.9500th=[26496], 99.9900th=[41728]
> Latencies for: Receiver (msg=39005)
>      percentiles (nsec):
>       |  1.0000th=[  410],  5.0000th=[  604], 10.0000th=[  700],
>       | 20.0000th=[  900], 30.0000th=[ 1128], 40.0000th=[ 1320],
>       | 50.0000th=[ 1672], 60.0000th=[ 2256], 70.0000th=[ 2736],
>       | 80.0000th=[ 3760], 90.0000th=[ 5408], 95.0000th=[11072],
>       | 99.0000th=[18304], 99.5000th=[20096], 99.9000th=[24704],
>       | 99.9500th=[27520], 99.9900th=[35584]
>
Jens Axboe May 29, 2024, 2:42 a.m. UTC | #11
On 5/28/24 8:08 PM, Pavel Begunkov wrote:
> On 5/29/24 02:35, Jens Axboe wrote:
>> On 5/28/24 5:04 PM, Jens Axboe wrote:
>>> On 5/28/24 12:31 PM, Jens Axboe wrote:
>>>> I suspect a bug in the previous patches, because this is what the
>>>> forward port looks like. First, for reference, the current results:
>>>
>>> Got it sorted, and pinned sender and receiver on CPUs to avoid the
>>> variation. It looks like this with the task_work approach that I sent
>>> out as v1:
>>>
>>> Latencies for: Sender
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 2160],  5.0000th=[ 2672], 10.0000th=[ 2768],
>>>       | 20.0000th=[ 3568], 30.0000th=[ 3568], 40.0000th=[ 3600],
>>>       | 50.0000th=[ 3600], 60.0000th=[ 3600], 70.0000th=[ 3632],
>>>       | 80.0000th=[ 3632], 90.0000th=[ 3664], 95.0000th=[ 3696],
>>>       | 99.0000th=[ 4832], 99.5000th=[15168], 99.9000th=[16192],
>>>       | 99.9500th=[16320], 99.9900th=[18304]
>>> Latencies for: Receiver
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 1528],  5.0000th=[ 1576], 10.0000th=[ 1656],
>>>       | 20.0000th=[ 2040], 30.0000th=[ 2064], 40.0000th=[ 2064],
>>>       | 50.0000th=[ 2064], 60.0000th=[ 2064], 70.0000th=[ 2096],
>>>       | 80.0000th=[ 2096], 90.0000th=[ 2128], 95.0000th=[ 2160],
>>>       | 99.0000th=[ 3472], 99.5000th=[14784], 99.9000th=[15168],
>>>       | 99.9500th=[15424], 99.9900th=[17280]
>>>
>>> and here's the exact same test run on the current patches:
>>>
>>> Latencies for: Sender
>>>      percentiles (nsec):
>>>       |  1.0000th=[  362],  5.0000th=[  362], 10.0000th=[  370],
>>>       | 20.0000th=[  370], 30.0000th=[  370], 40.0000th=[  370],
>>>       | 50.0000th=[  374], 60.0000th=[  382], 70.0000th=[  382],
>>>       | 80.0000th=[  382], 90.0000th=[  382], 95.0000th=[  390],
>>>       | 99.0000th=[  402], 99.5000th=[  430], 99.9000th=[  900],
>>>       | 99.9500th=[  972], 99.9900th=[ 1432]
>>> Latencies for: Receiver
>>>      percentiles (nsec):
>>>       |  1.0000th=[ 1528],  5.0000th=[ 1544], 10.0000th=[ 1560],
>>>       | 20.0000th=[ 1576], 30.0000th=[ 1592], 40.0000th=[ 1592],
>>>       | 50.0000th=[ 1592], 60.0000th=[ 1608], 70.0000th=[ 1608],
>>>       | 80.0000th=[ 1640], 90.0000th=[ 1672], 95.0000th=[ 1688],
>>>       | 99.0000th=[ 1848], 99.5000th=[ 2128], 99.9000th=[14272],
>>>       | 99.9500th=[14784], 99.9900th=[73216]
>>>
>>> I'll try and augment the test app to do proper rated submissions, so I
>>> can ramp up the rates a bit and see what happens.
>>
>> And the final one, with the rated sends sorted out. One key observation
>> is that v1 trails the current edition, it just can't keep up as the rate
>> is increased. If we cap the rate at at what should be 33K messages per
>> second, v1 gets ~28K messages and has the following latency profile (for
>> a 3 second run)
> 
> Do you see where the receiver latency comes from? The wakeups are
> quite similar in nature, assuming it's all wait(nr=1) and CPUs
> are not 100% consumed. The hop back spoils scheduling timing?

I haven't dug into that side yet, but I'm guessing it's indeed
scheduling artifacts. It's all doing single waits, the sender is doing
io_uring_submit_and_wait(ring, 1) and the receiver is doing
io_uring_wait_cqe(ring, &cqe);