[PATCHSET,v2,0/7] Improve MSG_RING DEFER_TASKRUN performance

Message ID	20240530152822.535791-2-axboe@kernel.dk (mailing list archive)
Headers	show Received: from mail-oi1-f177.google.com (mail-oi1-f177.google.com [209.85.167.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4A892187557 for <io-uring@vger.kernel.org>; Thu, 30 May 2024 15:28:34 +0000 (UTC) From: Jens Axboe <axboe@kernel.dk> To: io-uring@vger.kernel.org Subject: [PATCHSET v2 0/7] Improve MSG_RING DEFER_TASKRUN performance Date: Thu, 30 May 2024 09:23:37 -0600 Message-ID: <20240530152822.535791-2-axboe@kernel.dk> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Improve MSG_RING DEFER_TASKRUN performance \| expand [PATCHSET,v2,0/7] Improve MSG_RING DEFER_TASKRUN performance [1/7] io_uring/msg_ring: split fd installing into a helper [2/7] io_uring/msg_ring: tighten requirement for remote posting [3/7] io_uring/msg_ring: avoid double indirection task_work for data messages [4/7] io_uring/msg_ring: avoid double indirection task_work for fd passing [5/7] io_uring/msg_ring: add an alloc cache for CQE entries [6/7] io_uring/msg_ring: remove callback_head from struct io_msg [7/7] io_uring/msg_ring: remove non-remote message passing

Jens Axboe May 30, 2024, 3:23 p.m. UTC

Hi,

For v1 and replies to that and tons of perf measurements, go here:

https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@kernel.dk/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73

as I won't duplicate them in here. Performance has been improved since
v1 as well, as the slab accounting is gone and we now rely soly on
the completion_lock on the issuer side.

Changes since v1:
- Change commit messages to reflect it's DEFER_TASKRUN, not SINGLE_ISSUER
- Get rid of the need to double lock on the target uring_lock
- Relax the check for needing remote posting, and then finally kill it
- Unify it across ring types
- Kill (now) unused callback_head in io_msg
- Add overflow caching to avoid __GFP_ACCOUNT overhead
- Rebase on current git master with 6.9 and 6.10 fixes pulled in

Pavel Begunkov June 3, 2024, 1:53 p.m. UTC | #1

On 5/30/24 16:23, Jens Axboe wrote:
> Hi,
> 
> For v1 and replies to that and tons of perf measurements, go here:

I'd really prefer the task_work version rather than carving
yet another path specific to msg_ring. Perf might sounds better,
but it's duplicating wake up paths, not integrated with batch
waiting, not clear how affects different workloads with target
locking and would work weird in terms of ordering.

If the swing back is that expensive, another option is to
allocate a new request and let the target ring to deallocate
it once the message is delivered (similar to that overflow
entry).

> https://lore.kernel.org/io-uring/3d553205-0fe2-482e-8d4c-a4a1ad278893@kernel.dk/T/#m12f44c0a9ee40a59b0dcc226e22a0d031903aa73
> 
> as I won't duplicate them in here. Performance has been improved since
> v1 as well, as the slab accounting is gone and we now rely soly on
> the completion_lock on the issuer side.
> 
> Changes since v1:
> - Change commit messages to reflect it's DEFER_TASKRUN, not SINGLE_ISSUER
> - Get rid of the need to double lock on the target uring_lock
> - Relax the check for needing remote posting, and then finally kill it
> - Unify it across ring types
> - Kill (now) unused callback_head in io_msg
> - Add overflow caching to avoid __GFP_ACCOUNT overhead
> - Rebase on current git master with 6.9 and 6.10 fixes pulled in
>

Jens Axboe June 4, 2024, 6:57 p.m. UTC | #2

On 6/3/24 7:53 AM, Pavel Begunkov wrote:
> On 5/30/24 16:23, Jens Axboe wrote:
>> Hi,
>>
>> For v1 and replies to that and tons of perf measurements, go here:
> 
> I'd really prefer the task_work version rather than carving
> yet another path specific to msg_ring. Perf might sounds better,
> but it's duplicating wake up paths, not integrated with batch
> waiting, not clear how affects different workloads with target
> locking and would work weird in terms of ordering.

The duplication is really minor, basically non-existent imho. It's a
wakeup call, it's literally 2 lines of code. I do agree on the batching,
though I don't think that's really a big concern as most usage I'd
expect from this would be sending single messages. You're not batch
waiting on those. But there could obviously be cases where you have a
lot of mixed traffic, and for those it would make sense to have the
batch wakeups.

What I do like with this version is that we end up with just one method
for delivering the CQE, rather than needing to split it into two. And it
gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
that doesn't mean we should just ignore the cases where that isn't true.
Unifying that code and making it faster all around is a worthy goal in
and of itself. The code is CERTAINLY a lot cleaner after the change than
all the IOPOLL etc.

> If the swing back is that expensive, another option is to
> allocate a new request and let the target ring to deallocate
> it once the message is delivered (similar to that overflow
> entry).

I can give it a shot, and then run some testing. If we get close enough
with the latencies and performance, then I'd certainly be more amenable
to going either route.

We'd definitely need to pass in the required memory and avoid the return
round trip, as that basically doubles the cost (and latency) of sending
a message. The downside of what you suggest here is that while that
should integrate nicely with existing local task_work, it'll also mean
that we'll need hot path checks for treating that request type as a
special thing. Things like req->ctx being not local, freeing the request
rather than recycling, etc. And that'll need to happen in multiple
spots.

Jens Axboe June 4, 2024, 7:55 p.m. UTC | #3

On 6/4/24 12:57 PM, Jens Axboe wrote:
>> If the swing back is that expensive, another option is to
>> allocate a new request and let the target ring to deallocate
>> it once the message is delivered (similar to that overflow
>> entry).
> 
> I can give it a shot, and then run some testing. If we get close enough
> with the latencies and performance, then I'd certainly be more amenable
> to going either route.
> 
> We'd definitely need to pass in the required memory and avoid the return
> round trip, as that basically doubles the cost (and latency) of sending
> a message. The downside of what you suggest here is that while that
> should integrate nicely with existing local task_work, it'll also mean
> that we'll need hot path checks for treating that request type as a
> special thing. Things like req->ctx being not local, freeing the request
> rather than recycling, etc. And that'll need to happen in multiple
> spots.

On top of that, you also need CQE memory for the other side, it's not
just the req itself. Otherwise you don't know if it'll post or not, in
case of low memory situations.

I dunno, I feel like this solution would get a lot more complicated than
it is now, rather than make it simpler.

Pavel Begunkov June 5, 2024, 3:50 p.m. UTC | #4

On 6/4/24 19:57, Jens Axboe wrote:
> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>> On 5/30/24 16:23, Jens Axboe wrote:
>>> Hi,
>>>
>>> For v1 and replies to that and tons of perf measurements, go here:
>>
>> I'd really prefer the task_work version rather than carving
>> yet another path specific to msg_ring. Perf might sounds better,
>> but it's duplicating wake up paths, not integrated with batch
>> waiting, not clear how affects different workloads with target
>> locking and would work weird in terms of ordering.
> 
> The duplication is really minor, basically non-existent imho. It's a
> wakeup call, it's literally 2 lines of code. I do agree on the batching,

Well, v3 tries to add msg_ring/nr_overflow handling to local
task work, that what I mean by duplicating paths, and we'll
continue gutting the hot path for supporting msg_ring in
this way.

Does it work with eventfd? I can't find any handling, so next
you'd be adding:

io_commit_cqring_flush(ctx);

Likely draining around cq_extra should also be patched.
Yes, fixable, but it'll be a pile of fun, and without many
users, it'll take time to discover it all.

> though I don't think that's really a big concern as most usage I'd
> expect from this would be sending single messages. You're not batch
> waiting on those. But there could obviously be cases where you have a
> lot of mixed traffic, and for those it would make sense to have the
> batch wakeups.
> 
> What I do like with this version is that we end up with just one method
> for delivering the CQE, rather than needing to split it into two. And it
> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know

You can't get rid of target locking for fd passing, the file tables
are sync'ed by the lock. Otherwise it's only IOPOLL, because with
normal rings it can and IIRC does take the completion_lock for CQE
posting. I don't see a problem here, unless you care that much about
IOPOLL?

> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
> that doesn't mean we should just ignore the cases where that isn't true.
> Unifying that code and making it faster all around is a worthy goal in
> and of itself. The code is CERTAINLY a lot cleaner after the change than
> all the IOPOLL etc.
> 
>> If the swing back is that expensive, another option is to
>> allocate a new request and let the target ring to deallocate
>> it once the message is delivered (similar to that overflow
>> entry).
> 
> I can give it a shot, and then run some testing. If we get close enough
> with the latencies and performance, then I'd certainly be more amenable
> to going either route.
> 
> We'd definitely need to pass in the required memory and avoid the return

Right, same as with CQEs

> round trip, as that basically doubles the cost (and latency) of sending

Sender's latency, which is IMHO not important at all

> a message. The downside of what you suggest here is that while that
> should integrate nicely with existing local task_work, it'll also mean
> that we'll need hot path checks for treating that request type as a
> special thing. Things like req->ctx being not local, freeing the request
> rather than recycling, etc. And that'll need to happen in multiple
> spots.

I'm not suggesting feeding that request into flush_completions()
and common completion infra, can be killed right in the tw callback.

Jens Axboe June 5, 2024, 4:41 p.m. UTC | #5

On 6/5/24 9:50 AM, Pavel Begunkov wrote:
> On 6/4/24 19:57, Jens Axboe wrote:
>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>
>>> I'd really prefer the task_work version rather than carving
>>> yet another path specific to msg_ring. Perf might sounds better,
>>> but it's duplicating wake up paths, not integrated with batch
>>> waiting, not clear how affects different workloads with target
>>> locking and would work weird in terms of ordering.
>>
>> The duplication is really minor, basically non-existent imho. It's a
>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
> 
> Well, v3 tries to add msg_ring/nr_overflow handling to local
> task work, that what I mean by duplicating paths, and we'll
> continue gutting the hot path for supporting msg_ring in
> this way.

No matter how you look at it, there will be changes to the hot path
regardless of whether we use local task_work like in the original, or do
the current approach.

> Does it work with eventfd? I can't find any handling, so next
> you'd be adding:
> 
> io_commit_cqring_flush(ctx);

That's merely because the flagging should be done in io_defer_wake(),
moving that code to the common helper as well.

> Likely draining around cq_extra should also be patched.
> Yes, fixable, but it'll be a pile of fun, and without many
> users, it'll take time to discover it all.

Yes that may need tweaking indeed. But this is a bit of a chicken and
egg problem - there are not many users of it, because it currently
sucks. We have to make it better, and there's already one user lined up
because of these changes.

We can't just let MSG_RING linger. It's an appealing interface for
message passing where you are using rings on both sides, but it's
currently pretty much useless exactly for the case that we care about
the most - DEFER_TASKRUN. So right now you are caught between a rock and
a hard place, where you want to use DEFER_TASKRUN because it's a lot
better for the things that people care about, but if you need message
passing, then it doesn't work very well.

>> though I don't think that's really a big concern as most usage I'd
>> expect from this would be sending single messages. You're not batch
>> waiting on those. But there could obviously be cases where you have a
>> lot of mixed traffic, and for those it would make sense to have the
>> batch wakeups.
>>
>> What I do like with this version is that we end up with just one method
>> for delivering the CQE, rather than needing to split it into two. And it
>> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
> 
> You can't get rid of target locking for fd passing, the file tables
> are sync'ed by the lock. Otherwise it's only IOPOLL, because with
> normal rings it can and IIRC does take the completion_lock for CQE
> posting. I don't see a problem here, unless you care that much about
> IOPOLL?

Right, fd passing still needs to grab the lock, and it still does with
the patchset. We can't really get around it for fd passing, at least not
without further work (of which I have no current plans to do). I don't
care about IOPOLL in particular for message passing, I don't think there
are any good use cases there. It's more of a code hygiene thing, the
branches are still there and do exist.

>> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
>> that doesn't mean we should just ignore the cases where that isn't true.
>> Unifying that code and making it faster all around is a worthy goal in
>> and of itself. The code is CERTAINLY a lot cleaner after the change than
>> all the IOPOLL etc.
>>
>>> If the swing back is that expensive, another option is to
>>> allocate a new request and let the target ring to deallocate
>>> it once the message is delivered (similar to that overflow
>>> entry).
>>
>> I can give it a shot, and then run some testing. If we get close enough
>> with the latencies and performance, then I'd certainly be more amenable
>> to going either route.
>>
>> We'd definitely need to pass in the required memory and avoid the return
> 
> Right, same as with CQEs
> 
>> round trip, as that basically doubles the cost (and latency) of sending
> 
> Sender's latency, which is IMHO not important at all

But it IS important. Not because of the latency itself, that part is
less important, but because of the added overhead of bouncing from ring1
to ring2, and then back from ring2 to ring1. The reduction in latency is
a direct reflecting of the reduction of overhead.

>> a message. The downside of what you suggest here is that while that
>> should integrate nicely with existing local task_work, it'll also mean
>> that we'll need hot path checks for treating that request type as a
>> special thing. Things like req->ctx being not local, freeing the request
>> rather than recycling, etc. And that'll need to happen in multiple
>> spots.
> 
> I'm not suggesting feeding that request into flush_completions()
> and common completion infra, can be killed right in the tw callback.

Right, so you need to special case these requests when you run the local
task_work. Which was my point above, you're going to need to accept hot
path additions regardless of the approach.

Pavel Begunkov June 5, 2024, 7:20 p.m. UTC | #6

On 6/5/24 17:41, Jens Axboe wrote:
> On 6/5/24 9:50 AM, Pavel Begunkov wrote:
>> On 6/4/24 19:57, Jens Axboe wrote:
>>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>>> Hi,
>>>>>
>>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>>
>>>> I'd really prefer the task_work version rather than carving
>>>> yet another path specific to msg_ring. Perf might sounds better,
>>>> but it's duplicating wake up paths, not integrated with batch
>>>> waiting, not clear how affects different workloads with target
>>>> locking and would work weird in terms of ordering.
>>>
>>> The duplication is really minor, basically non-existent imho. It's a
>>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
>>
>> Well, v3 tries to add msg_ring/nr_overflow handling to local
>> task work, that what I mean by duplicating paths, and we'll
>> continue gutting the hot path for supporting msg_ring in
>> this way.
> 
> No matter how you look at it, there will be changes to the hot path
> regardless of whether we use local task_work like in the original, or do
> the current approach.

The only downside for !msg_ring paths in the original was
un-inlining of local tw_add().

>> Does it work with eventfd? I can't find any handling, so next
>> you'd be adding:
>>
>> io_commit_cqring_flush(ctx);
> 
> That's merely because the flagging should be done in io_defer_wake(),
> moving that code to the common helper as well.

Flagging? If you mean io_commit_cqring_flush() then no,
it shouldn't and cannot be there. It's called strictly after
posting a CQE (or queuing an overflow), which is after tw
callback execution.

>> Likely draining around cq_extra should also be patched.
>> Yes, fixable, but it'll be a pile of fun, and without many
>> users, it'll take time to discover it all.
> 
> Yes that may need tweaking indeed. But this is a bit of a chicken and
> egg problem - there are not many users of it, because it currently
> sucks. We have to make it better, and there's already one user lined up
> because of these changes.
> 
> We can't just let MSG_RING linger. It's an appealing interface for
> message passing where you are using rings on both sides, but it's
> currently pretty much useless exactly for the case that we care about
> the most - DEFER_TASKRUN. So right now you are caught between a rock and
> a hard place, where you want to use DEFER_TASKRUN because it's a lot
> better for the things that people care about, but if you need message
> passing, then it doesn't work very well.
> 
>>> though I don't think that's really a big concern as most usage I'd
>>> expect from this would be sending single messages. You're not batch
>>> waiting on those. But there could obviously be cases where you have a
>>> lot of mixed traffic, and for those it would make sense to have the
>>> batch wakeups.
>>>
>>> What I do like with this version is that we end up with just one method
>>> for delivering the CQE, rather than needing to split it into two. And it
>>> gets rid of the uring_lock double locking for non-SINGLE_ISSUER. I know
>>
>> You can't get rid of target locking for fd passing, the file tables
>> are sync'ed by the lock. Otherwise it's only IOPOLL, because with
>> normal rings it can and IIRC does take the completion_lock for CQE
>> posting. I don't see a problem here, unless you care that much about
>> IOPOLL?
> 
> Right, fd passing still needs to grab the lock, and it still does with
> the patchset. We can't really get around it for fd passing, at least not
> without further work (of which I have no current plans to do). I don't
> care about IOPOLL in particular for message passing, I don't think there
> are any good use cases there. It's more of a code hygiene thing, the
> branches are still there and do exist.
> 
>>> we always try and push people towards DEFER_TASKRUN|SINGLE_ISSUER, but
>>> that doesn't mean we should just ignore the cases where that isn't true.
>>> Unifying that code and making it faster all around is a worthy goal in
>>> and of itself. The code is CERTAINLY a lot cleaner after the change than
>>> all the IOPOLL etc.
>>>
>>>> If the swing back is that expensive, another option is to
>>>> allocate a new request and let the target ring to deallocate
>>>> it once the message is delivered (similar to that overflow
>>>> entry).
>>>
>>> I can give it a shot, and then run some testing. If we get close enough
>>> with the latencies and performance, then I'd certainly be more amenable
>>> to going either route.
>>>
>>> We'd definitely need to pass in the required memory and avoid the return
>>
>> Right, same as with CQEs
>>
>>> round trip, as that basically doubles the cost (and latency) of sending
>>
>> Sender's latency, which is IMHO not important at all
> 
> But it IS important. Not because of the latency itself, that part is
> less important, but because of the added overhead of bouncing from ring1
> to ring2, and then back from ring2 to ring1. The reduction in latency is
> a direct reflecting of the reduction of overhead.
> 
>>> a message. The downside of what you suggest here is that while that
>>> should integrate nicely with existing local task_work, it'll also mean
>>> that we'll need hot path checks for treating that request type as a
>>> special thing. Things like req->ctx being not local, freeing the request
>>> rather than recycling, etc. And that'll need to happen in multiple
>>> spots.
>>
>> I'm not suggesting feeding that request into flush_completions()
>> and common completion infra, can be killed right in the tw callback.
> 
> Right, so you need to special case these requests when you run the local
> task_work. Which was my point above, you're going to need to accept hot
> path additions regardless of the approach.
>

Jens Axboe June 5, 2024, 7:36 p.m. UTC | #7

On 6/5/24 1:20 PM, Pavel Begunkov wrote:
> On 6/5/24 17:41, Jens Axboe wrote:
>> On 6/5/24 9:50 AM, Pavel Begunkov wrote:
>>> On 6/4/24 19:57, Jens Axboe wrote:
>>>> On 6/3/24 7:53 AM, Pavel Begunkov wrote:
>>>>> On 5/30/24 16:23, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> For v1 and replies to that and tons of perf measurements, go here:
>>>>>
>>>>> I'd really prefer the task_work version rather than carving
>>>>> yet another path specific to msg_ring. Perf might sounds better,
>>>>> but it's duplicating wake up paths, not integrated with batch
>>>>> waiting, not clear how affects different workloads with target
>>>>> locking and would work weird in terms of ordering.
>>>>
>>>> The duplication is really minor, basically non-existent imho. It's a
>>>> wakeup call, it's literally 2 lines of code. I do agree on the batching,
>>>
>>> Well, v3 tries to add msg_ring/nr_overflow handling to local
>>> task work, that what I mean by duplicating paths, and we'll
>>> continue gutting the hot path for supporting msg_ring in
>>> this way.
>>
>> No matter how you look at it, there will be changes to the hot path
>> regardless of whether we use local task_work like in the original, or do
>> the current approach.
> 
> The only downside for !msg_ring paths in the original was
> un-inlining of local tw_add().

You're comparing an incomplete RFC to a more complete patchset, that
will not be the only downside once you're done with the local task_work
approach when the roundtrip is avoided. And that is my comparison base,
not some half finished POC that I posted for comments.

>>> Does it work with eventfd? I can't find any handling, so next
>>> you'd be adding:
>>>
>>> io_commit_cqring_flush(ctx);
>>
>> That's merely because the flagging should be done in io_defer_wake(),
>> moving that code to the common helper as well.
> 
> Flagging? If you mean io_commit_cqring_flush() then no,
> it shouldn't and cannot be there. It's called strictly after
> posting a CQE (or queuing an overflow), which is after tw
> callback execution.

I meant the SQ ring flagging and eventfd signaling, which is currently
done in local work adding. That should go in io_defer_wake().

[PATCHSET,v2,0/7] Improve MSG_RING DEFER_TASKRUN performance

Message

Comments