Message ID | 20240123174021.1967461-3-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/8] block/mq-deadline: pass in queue directly to dd_insert_request() | expand |
On 1/23/24 09:34, Jens Axboe wrote: > + struct { > + spinlock_t lock; > + spinlock_t zone_lock; > + } ____cacheline_aligned_in_smp; It is not clear to me why the ____cacheline_aligned_in_smp attribute is applied to the two spinlocks combined? Can this cause both spinlocks to end up in the same cache line? If the ____cacheline_aligned_in_smp attribute would be applied to each spinlock separately, could that improve performance even further? Otherwise this patch looks good to me, hence: Reviewed-by: Bart Van Assche <bvanassche@acm.org>
On 1/23/24 11:36 AM, Bart Van Assche wrote: > On 1/23/24 09:34, Jens Axboe wrote: >> + struct { >> + spinlock_t lock; >> + spinlock_t zone_lock; >> + } ____cacheline_aligned_in_smp; > > It is not clear to me why the ____cacheline_aligned_in_smp attribute > is applied to the two spinlocks combined? Can this cause both spinlocks > to end up in the same cache line? If the ____cacheline_aligned_in_smp > attribute would be applied to each spinlock separately, could that > improve performance even further? Otherwise this patch looks good to me, > hence: It is somewhat counterintuitive, but my testing shows that there's no problem with them in the same cacheline. Hence I'm reluctant to move them out of the struct and align both of them, as it'd just waste memory for seemingly no runtime benefit. > Reviewed-by: Bart Van Assche <bvanassche@acm.org> Thanks!
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
On Tue, Jan 23, 2024 at 12:13:29PM -0700, Jens Axboe wrote: > On 1/23/24 11:36 AM, Bart Van Assche wrote: > > On 1/23/24 09:34, Jens Axboe wrote: > >> + struct { > >> + spinlock_t lock; > >> + spinlock_t zone_lock; > >> + } ____cacheline_aligned_in_smp; > > > > It is not clear to me why the ____cacheline_aligned_in_smp attribute > > is applied to the two spinlocks combined? Can this cause both spinlocks > > to end up in the same cache line? If the ____cacheline_aligned_in_smp > > attribute would be applied to each spinlock separately, could that > > improve performance even further? Otherwise this patch looks good to me, > > hence: > > It is somewhat counterintuitive, but my testing shows that there's no > problem with them in the same cacheline. Hence I'm reluctant to move > them out of the struct and align both of them, as it'd just waste memory > for seemingly no runtime benefit. Is there ay benefit in aligning either of them? The whole cache line align locks thing seemed to have been very popular 20 years ago, and these days it tends to not make much of a difference.
On 1/24/24 2:31 AM, Christoph Hellwig wrote: > On Tue, Jan 23, 2024 at 12:13:29PM -0700, Jens Axboe wrote: >> On 1/23/24 11:36 AM, Bart Van Assche wrote: >>> On 1/23/24 09:34, Jens Axboe wrote: >>>> + struct { >>>> + spinlock_t lock; >>>> + spinlock_t zone_lock; >>>> + } ____cacheline_aligned_in_smp; >>> >>> It is not clear to me why the ____cacheline_aligned_in_smp attribute >>> is applied to the two spinlocks combined? Can this cause both spinlocks >>> to end up in the same cache line? If the ____cacheline_aligned_in_smp >>> attribute would be applied to each spinlock separately, could that >>> improve performance even further? Otherwise this patch looks good to me, >>> hence: >> >> It is somewhat counterintuitive, but my testing shows that there's no >> problem with them in the same cacheline. Hence I'm reluctant to move >> them out of the struct and align both of them, as it'd just waste memory >> for seemingly no runtime benefit. > > Is there ay benefit in aligning either of them? The whole cache line > align locks thing seemed to have been very popular 20 years ago, > and these days it tends to not make much of a difference. It's about 1% for me, so does make a difference. Probably because it shares with run_state otherwise. And I'll have to violently disagree that aligning frequently used locks outside of other modified or mostly-read regions was a fad from 20 years ago, it still very much makes sense. It may be overdone, like any "trick" like that, but it's definitely not useless.
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 9b7563e9d638..79bc3b6784b3 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -79,10 +79,20 @@ struct dd_per_prio { struct io_stats_per_prio stats; }; +enum { + DD_DISPATCHING = 0, +}; + struct deadline_data { /* * run time data */ + struct { + spinlock_t lock; + spinlock_t zone_lock; + } ____cacheline_aligned_in_smp; + + unsigned long run_state; struct dd_per_prio per_prio[DD_PRIO_COUNT]; @@ -100,9 +110,6 @@ struct deadline_data { int front_merges; u32 async_depth; int prio_aging_expire; - - spinlock_t lock; - spinlock_t zone_lock; }; /* Maps an I/O priority class to a deadline scheduler priority. */ @@ -600,6 +607,18 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) struct request *rq; enum dd_prio prio; + /* + * If someone else is already dispatching, skip this one. This will + * defer the next dispatch event to when something completes, and could + * potentially lower the queue depth for contended cases. + * + * See the logic in blk_mq_do_dispatch_sched(), which loops and + * retries if nothing is dispatched. + */ + if (test_bit(DD_DISPATCHING, &dd->run_state) || + test_and_set_bit_lock(DD_DISPATCHING, &dd->run_state)) + return NULL; + spin_lock(&dd->lock); rq = dd_dispatch_prio_aged_requests(dd, now); if (rq) @@ -616,6 +635,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) } unlock: + clear_bit_unlock(DD_DISPATCHING, &dd->run_state); spin_unlock(&dd->lock); return rq; @@ -706,6 +726,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) eq->elevator_data = dd; + spin_lock_init(&dd->lock); + spin_lock_init(&dd->zone_lock); + for (prio = 0; prio <= DD_PRIO_MAX; prio++) { struct dd_per_prio *per_prio = &dd->per_prio[prio]; @@ -722,8 +745,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) dd->last_dir = DD_WRITE; dd->fifo_batch = fifo_batch; dd->prio_aging_expire = prio_aging_expire; - spin_lock_init(&dd->lock); - spin_lock_init(&dd->zone_lock); /* We dispatch from request queue wide instead of hw queue */ blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
If we're entering request dispatch but someone else is already dispatching, then just skip this dispatch. We know IO is inflight and this will trigger another dispatch event for any completion. This will potentially cause slightly lower queue depth for contended cases, but those are slowed down anyway and this should not cause an issue. By itself, this patch doesn't help a whole lot, as the dispatch lock contention reduction is just eating up by the same dd->lock now seeing increased insertion contention. But it's required work to be able to reduce the lock contention in general. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/mq-deadline.c | 31 ++++++++++++++++++++++++++----- 1 file changed, 26 insertions(+), 5 deletions(-)