Message ID | 20240326184615.458820-1-axboe@kernel.dk (mailing list archive) |
---|---|
Headers | show |
Series | Use io_wq_work_list for task_work | expand |
On 3/26/24 18:42, Jens Axboe wrote: > Hi, > > This converts the deferred, normal, and fallback task_work to use a > normal io_wq_work_list, rather than an llist. > > The main motivation behind this is to get rid of the need to reverse > the list once it's deleted and run. I tested this basic conversion of > just switching it from an llist to an io_wq_work_list with a spinlock, > and I don't see any benefits from the lockless list. And for cases where > we get a bursty addition of task_work, this approach is faster as it > avoids the need to iterate the list upfront while reversing it. I'm curious how you benchmarked it including accounting of irq/softirq where tw add usually happens? One known problem with the current list approach I mentioned several times before is that it peeks at the previous queued tw to count them. It's not nice, but that can be easily done with cmpxchg double. I wonder how much of an issue is that. > And this is less code and simpler, so I'd prefer to go that route. I'm not sure it's less code, if you return optimisations that I believe were killed, see comments to patch 2, it might turn out to be even bulkier and not that simpler. > include/linux/io_uring_types.h | 13 +-- > io_uring/io_uring.c | 175 ++++++++++++++++----------------- > io_uring/io_uring.h | 8 +- > io_uring/sqpoll.c | 8 +- > io_uring/tctx.c | 3 +- > 5 files changed, 102 insertions(+), 105 deletions(-) >
On 3/27/24 7:33 AM, Pavel Begunkov wrote: > On 3/26/24 18:42, Jens Axboe wrote: >> Hi, >> >> This converts the deferred, normal, and fallback task_work to use a >> normal io_wq_work_list, rather than an llist. >> >> The main motivation behind this is to get rid of the need to reverse >> the list once it's deleted and run. I tested this basic conversion of >> just switching it from an llist to an io_wq_work_list with a spinlock, >> and I don't see any benefits from the lockless list. And for cases where >> we get a bursty addition of task_work, this approach is faster as it >> avoids the need to iterate the list upfront while reversing it. > > I'm curious how you benchmarked it including accounting of irq/softirq > where tw add usually happens? Performance based and profiles. I tested send zc with small packets, as that is task_work intensive and exhibits the bursty behavior I mentioned in the patch / cover letter. And normal storage IO, IRQ driven. For send zc, we're spending about 2% of the time doing list reversal, and I've seen as high as 5 in other testing. And as that test is CPU bound, performance is up about 2% as well. With the patches, task work adding accounts for about 0.25% of the cycles, before it's about 0.66%. We're spending a bit more time in __io_run_local_work(), but I think that's deceptive as we have to disable/enable interrupts now. If an interrupt triggers on the unlock, that time tends to be attributed there in terms of cycles. > One known problem with the current list approach I mentioned several > times before is that it peeks at the previous queued tw to count them. > It's not nice, but that can be easily done with cmpxchg double. I > wonder how much of an issue is that. That's more of a wart than a real issue though, but we this approach obviously doesn't do that. And then we can drop the rcu section around adding local task_work. Not a huge deal, but still nice. >> And this is less code and simpler, so I'd prefer to go that route. > > I'm not sure it's less code, if you return optimisations that I > believe were killed, see comments to patch 2, it might turn out to > be even bulkier and not that simpler. It's still considerably less: 3 files changed, 59 insertions(+), 84 deletions(-) thought that isn't conclusive by itself, as eg io_llist_xchg() goes away which has some comments. But I do think the resulting code is simpler and more straight forward.
On 3/27/24 10:36 AM, Jens Axboe wrote: > On 3/27/24 7:33 AM, Pavel Begunkov wrote: >> On 3/26/24 18:42, Jens Axboe wrote: >>> Hi, >>> >>> This converts the deferred, normal, and fallback task_work to use a >>> normal io_wq_work_list, rather than an llist. >>> >>> The main motivation behind this is to get rid of the need to reverse >>> the list once it's deleted and run. I tested this basic conversion of >>> just switching it from an llist to an io_wq_work_list with a spinlock, >>> and I don't see any benefits from the lockless list. And for cases where >>> we get a bursty addition of task_work, this approach is faster as it >>> avoids the need to iterate the list upfront while reversing it. >> >> I'm curious how you benchmarked it including accounting of irq/softirq >> where tw add usually happens? > > Performance based and profiles. I tested send zc with small packets, as > that is task_work intensive and exhibits the bursty behavior I mentioned > in the patch / cover letter. And normal storage IO, IRQ driven. > > For send zc, we're spending about 2% of the time doing list reversal, > and I've seen as high as 5 in other testing. And as that test is CPU > bound, performance is up about 2% as well. > > With the patches, task work adding accounts for about 0.25% of the > cycles, before it's about 0.66%. > > We're spending a bit more time in __io_run_local_work(), but I think > that's deceptive as we have to disable/enable interrupts now. If an > interrupt triggers on the unlock, that time tends to be attributed there > in terms of cycles. Forgot to mention the storage side - profiles look eerily similar in terms of time spent in task work adding / running, the only real difference is that 1.9% of llist_reverse_list() is gone.
On 3/27/24 16:36, Jens Axboe wrote: > On 3/27/24 7:33 AM, Pavel Begunkov wrote: >> On 3/26/24 18:42, Jens Axboe wrote: >>> Hi, >>> >>> This converts the deferred, normal, and fallback task_work to use a >>> normal io_wq_work_list, rather than an llist. >>> >>> The main motivation behind this is to get rid of the need to reverse >>> the list once it's deleted and run. I tested this basic conversion of >>> just switching it from an llist to an io_wq_work_list with a spinlock, >>> and I don't see any benefits from the lockless list. And for cases where >>> we get a bursty addition of task_work, this approach is faster as it >>> avoids the need to iterate the list upfront while reversing it. >> >> I'm curious how you benchmarked it including accounting of irq/softirq >> where tw add usually happens? > > Performance based and profiles. I tested send zc with small packets, as > that is task_work intensive and exhibits the bursty behavior I mentioned > in the patch / cover letter. And normal storage IO, IRQ driven. I assume IRQs are firing on random CPUs then unless you configured it. In which case it should be bouncing of the cacheline + that peeking at the prev also needs to fetch it from RAM / further caches. Unless there is enough of time for TCP to batch them. > For send zc, we're spending about 2% of the time doing list reversal, > and I've seen as high as 5 in other testing. And as that test is CPU I've seen similar before, but for me it was overhead shifted from __io_run_local_work() fetching requests into reverse touching all of them. There should be a change in __io_run_local_work() total cycles (incl children) then I assume > bound, performance is up about 2% as well. Did you count by any chance how many items there was in the list? Average or so > With the patches, task work adding accounts for about 0.25% of the > cycles, before it's about 0.66%. i.e. spinlock is faster. How come? Same cmpxchg in spinlock with often cache misses, but with irq on/off on top. The only diff I can remember is that peek into prev req. > We're spending a bit more time in __io_run_local_work(), but I think > that's deceptive as we have to disable/enable interrupts now. If an > interrupt triggers on the unlock, that time tends to be attributed there > in terms of cycles. Hmm, I think if run_local_work runtime doesn't change you'd statistically get same number of interrupts "items" hitting it, but the would be condensed more to irq-off. Or are you accounting for some irq delivery / hw differences? >> One known problem with the current list approach I mentioned several >> times before is that it peeks at the previous queued tw to count them. >> It's not nice, but that can be easily done with cmpxchg double. I >> wonder how much of an issue is that. > > That's more of a wart than a real issue though, but we this approach Assuming tw add executing on random CPUs, that would be additional fetch every tw add, I wouldn't disregard it right away. > obviously doesn't do that. And then we can drop the rcu section around > adding local task_work. Not a huge deal, but still nice. I don't see how. After you queue the req it might be immediately executed dropping the ctx ref, which was previously protecting ctx. The rcu section protects ctx. >>> And this is less code and simpler, so I'd prefer to go that route. >> >> I'm not sure it's less code, if you return optimisations that I >> believe were killed, see comments to patch 2, it might turn out to >> be even bulkier and not that simpler. > > It's still considerably less: > > 3 files changed, 59 insertions(+), 84 deletions(-) > > thought that isn't conclusive by itself, as eg io_llist_xchg() goes away > which has some comments. But I do think the resulting code is simpler > and more straight forward.