mbox series

[PATCHSET,0/4] Use io_wq_work_list for task_work

Message ID 20240326184615.458820-1-axboe@kernel.dk (mailing list archive)
Headers show
Series Use io_wq_work_list for task_work | expand

Message

Jens Axboe March 26, 2024, 6:42 p.m. UTC
Hi,

This converts the deferred, normal, and fallback task_work to use a
normal io_wq_work_list, rather than an llist.

The main motivation behind this is to get rid of the need to reverse
the list once it's deleted and run. I tested this basic conversion of
just switching it from an llist to an io_wq_work_list with a spinlock,
and I don't see any benefits from the lockless list. And for cases where
we get a bursty addition of task_work, this approach is faster as it
avoids the need to iterate the list upfront while reversing it.

And this is less code and simpler, so I'd prefer to go that route.

 include/linux/io_uring_types.h |  13 +--
 io_uring/io_uring.c            | 175 ++++++++++++++++-----------------
 io_uring/io_uring.h            |   8 +-
 io_uring/sqpoll.c              |   8 +-
 io_uring/tctx.c                |   3 +-
 5 files changed, 102 insertions(+), 105 deletions(-)

Comments

Pavel Begunkov March 27, 2024, 1:33 p.m. UTC | #1
On 3/26/24 18:42, Jens Axboe wrote:
> Hi,
> 
> This converts the deferred, normal, and fallback task_work to use a
> normal io_wq_work_list, rather than an llist.
> 
> The main motivation behind this is to get rid of the need to reverse
> the list once it's deleted and run. I tested this basic conversion of
> just switching it from an llist to an io_wq_work_list with a spinlock,
> and I don't see any benefits from the lockless list. And for cases where
> we get a bursty addition of task_work, this approach is faster as it
> avoids the need to iterate the list upfront while reversing it.

I'm curious how you benchmarked it including accounting of irq/softirq
where tw add usually happens?

One known problem with the current list approach I mentioned several
times before is that it peeks at the previous queued tw to count them.
It's not nice, but that can be easily done with cmpxchg double. I
wonder how much of an issue is that.

> And this is less code and simpler, so I'd prefer to go that route.

I'm not sure it's less code, if you return optimisations that I
believe were killed, see comments to patch 2, it might turn out to
be even bulkier and not that simpler.


>   include/linux/io_uring_types.h |  13 +--
>   io_uring/io_uring.c            | 175 ++++++++++++++++-----------------
>   io_uring/io_uring.h            |   8 +-
>   io_uring/sqpoll.c              |   8 +-
>   io_uring/tctx.c                |   3 +-
>   5 files changed, 102 insertions(+), 105 deletions(-)
>
Jens Axboe March 27, 2024, 4:36 p.m. UTC | #2
On 3/27/24 7:33 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Hi,
>>
>> This converts the deferred, normal, and fallback task_work to use a
>> normal io_wq_work_list, rather than an llist.
>>
>> The main motivation behind this is to get rid of the need to reverse
>> the list once it's deleted and run. I tested this basic conversion of
>> just switching it from an llist to an io_wq_work_list with a spinlock,
>> and I don't see any benefits from the lockless list. And for cases where
>> we get a bursty addition of task_work, this approach is faster as it
>> avoids the need to iterate the list upfront while reversing it.
> 
> I'm curious how you benchmarked it including accounting of irq/softirq
> where tw add usually happens?

Performance based and profiles. I tested send zc with small packets, as
that is task_work intensive and exhibits the bursty behavior I mentioned
in the patch / cover letter. And normal storage IO, IRQ driven.

For send zc, we're spending about 2% of the time doing list reversal,
and I've seen as high as 5 in other testing. And as that test is CPU
bound, performance is up about 2% as well.

With the patches, task work adding accounts for about 0.25% of the
cycles, before it's about 0.66%.

We're spending a bit more time in __io_run_local_work(), but I think
that's deceptive as we have to disable/enable interrupts now. If an
interrupt triggers on the unlock, that time tends to be attributed there
in terms of cycles.

> One known problem with the current list approach I mentioned several
> times before is that it peeks at the previous queued tw to count them.
> It's not nice, but that can be easily done with cmpxchg double. I
> wonder how much of an issue is that.

That's more of a wart than a real issue though, but we this approach
obviously doesn't do that. And then we can drop the rcu section around
adding local task_work. Not a huge deal, but still nice.

>> And this is less code and simpler, so I'd prefer to go that route.
> 
> I'm not sure it's less code, if you return optimisations that I
> believe were killed, see comments to patch 2, it might turn out to
> be even bulkier and not that simpler.

It's still considerably less:

 3 files changed, 59 insertions(+), 84 deletions(-)

thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
which has some comments. But I do think the resulting code is simpler
and more straight forward.
Jens Axboe March 27, 2024, 5:05 p.m. UTC | #3
On 3/27/24 10:36 AM, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.
> 
> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU
> bound, performance is up about 2% as well.
> 
> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.
> 
> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Forgot to mention the storage side - profiles look eerily similar in
terms of time spent in task work adding / running, the only real
difference is that 1.9% of llist_reverse_list() is gone.
Pavel Begunkov March 27, 2024, 6:04 p.m. UTC | #4
On 3/27/24 16:36, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.

I assume IRQs are firing on random CPUs then unless you configured
it. In which case it should be bouncing of the cacheline + that
peeking at the prev also needs to fetch it from RAM / further caches.
Unless there is enough of time for TCP to batch them.

> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU

I've seen similar before, but for me it was overhead shifted from
__io_run_local_work() fetching requests into reverse touching all
of them. There should be a change in __io_run_local_work()
total cycles (incl children) then I assume

> bound, performance is up about 2% as well.

Did you count by any chance how many items there was in the
list? Average or so

> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.

i.e. spinlock is faster. How come? Same cmpxchg in spinlock
with often cache misses, but with irq on/off on top. The only
diff I can remember is that peek into prev req.

> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Hmm, I think if run_local_work runtime doesn't change you'd
statistically get same number of interrupts "items" hitting
it, but the would be condensed more to irq-off. Or are you
accounting for some irq delivery / hw differences?

>> One known problem with the current list approach I mentioned several
>> times before is that it peeks at the previous queued tw to count them.
>> It's not nice, but that can be easily done with cmpxchg double. I
>> wonder how much of an issue is that.
> 
> That's more of a wart than a real issue though, but we this approach

Assuming tw add executing on random CPUs, that would be additional
fetch every tw add, I wouldn't disregard it right away.

> obviously doesn't do that. And then we can drop the rcu section around
> adding local task_work. Not a huge deal, but still nice.

I don't see how. After you queue the req it might be immediately
executed dropping the ctx ref, which was previously protecting ctx.
The rcu section protects ctx.

>>> And this is less code and simpler, so I'd prefer to go that route.
>>
>> I'm not sure it's less code, if you return optimisations that I
>> believe were killed, see comments to patch 2, it might turn out to
>> be even bulkier and not that simpler.
> 
> It's still considerably less:
> 
>   3 files changed, 59 insertions(+), 84 deletions(-)
> 
> thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
> which has some comments. But I do think the resulting code is simpler
> and more straight forward.