Message ID | 20240118180541.930783-2-axboe@kernel.dk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mq-deadline scalability improvements | expand |
On 1/18/24 10:04, Jens Axboe wrote: > If we're entering request dispatch but someone else is already > dispatching, then just skip this dispatch. We know IO is inflight and > this will trigger another dispatch event for any completion. This will > potentially cause slightly lower queue depth for contended cases, but > those are slowed down anyway and this should not cause an issue. Shouldn't a performance optimization patch include numbers that show by how much that patch improves the performance of different workloads? > struct deadline_data { > /* > * run time data > */ > + struct { > + spinlock_t lock; > + spinlock_t zone_lock; > + } ____cacheline_aligned_in_smp; Is ____cacheline_aligned_in_smp useful here? struct deadline_data is not embedded in any other data structure but is allocated with kzalloc_node(). > + /* > + * If someone else is already dispatching, skip this one. This will > + * defer the next dispatch event to when something completes, and could > + * potentially lower the queue depth for contended cases. > + */ > + if (test_bit(DD_DISPATCHING, &dd->run_state) || > + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) > + return NULL; > + The above code behaves similar to spin_trylock(). Has it been considered to use spin_trylock() instead? Thanks, Bart.
On 1/18/24 11:24 AM, Bart Van Assche wrote: > On 1/18/24 10:04, Jens Axboe wrote: >> If we're entering request dispatch but someone else is already >> dispatching, then just skip this dispatch. We know IO is inflight and >> this will trigger another dispatch event for any completion. This will >> potentially cause slightly lower queue depth for contended cases, but >> those are slowed down anyway and this should not cause an issue. > > Shouldn't a performance optimization patch include numbers that show by > how much that patch improves the performance of different workloads? The commit messages may want some editing, but all of that is in the cover letter that I'm sure you saw. This is just a POC/RFC posting. >> struct deadline_data { >> /* >> * run time data >> */ >> + struct { >> + spinlock_t lock; >> + spinlock_t zone_lock; >> + } ____cacheline_aligned_in_smp; > > Is ____cacheline_aligned_in_smp useful here? struct deadline_data > is not embedded in any other data structure but is allocated with kzalloc_node(). It's not for alignment of deadline_data, it's for alignment within deadline_data so that grabbing/releasing these locks don't share a cacheline with other bits. We could ensure that deadline_data itself is aligned as well, that might be a good idea. >> + /* >> + * If someone else is already dispatching, skip this one. This will >> + * defer the next dispatch event to when something completes, and could >> + * potentially lower the queue depth for contended cases. >> + */ >> + if (test_bit(DD_DISPATCHING, &dd->run_state) || >> + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) >> + return NULL; >> + > > The above code behaves similar to spin_trylock(). Has it been > considered to use spin_trylock() instead? Do you read the replies to the emails, from the other thread? First of all, you'd need another lock for this. And secondly, this avoids a RMW if it's already set. So no, I think this is cleaner than a separate lock, and you'd still have cacheline bouncing on that for ALL calls if you did it that way.
On 1/18/24 10:45, Jens Axboe wrote: > Do you read the replies to the emails, from the other thread? Yes. > And secondly, this avoids a RMW if it's already set. From the spinlock implementation (something I looked up before I wrote my previous reply): static __always_inline int queued_spin_trylock(struct qspinlock *lock) { int val = atomic_read(&lock->val); if (unlikely(val)) return 0; return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)); } I think this implementation does not perform a RMW if the spinlock is already locked. Thanks, Bart.
On 1/18/24 11:51 AM, Bart Van Assche wrote: > On 1/18/24 10:45, Jens Axboe wrote: >> Do you read the replies to the emails, from the other thread? > > Yes. Well it's been a bit frustrating to say the least, as you seem to either not read them or just ignore what's in them. > > And secondly, this avoids a RMW if it's already set. > From the spinlock implementation (something I looked up before I wrote > my previous reply): > > static __always_inline int queued_spin_trylock(struct qspinlock *lock) > { > int val = atomic_read(&lock->val); > > if (unlikely(val)) > return 0; > > return likely(atomic_try_cmpxchg_acquire(&lock->val, &val, > _Q_LOCKED_VAL)); > } > > I think this implementation does not perform a RMW if the spinlock is > already locked. This looks like a generic trylock, is the same true for the arch variants? Because your attempt either failed because of RMW or because you are hammering it repeatedly, or both. In any case, even if any trylock would avoid the initial atomic, it doesn't give you anything that a bitop would not just do, and we use the latter trick elsewhere in blk-mq.
On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote: > If we're entering request dispatch but someone else is already > dispatching, then just skip this dispatch. We know IO is inflight and > this will trigger another dispatch event for any completion. This will > potentially cause slightly lower queue depth for contended cases, but > those are slowed down anyway and this should not cause an issue. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > block/mq-deadline.c | 28 +++++++++++++++++++++++----- > 1 file changed, 23 insertions(+), 5 deletions(-) > > diff --git a/block/mq-deadline.c b/block/mq-deadline.c > index f958e79277b8..9e0ab3ea728a 100644 > --- a/block/mq-deadline.c > +++ b/block/mq-deadline.c > @@ -79,10 +79,20 @@ struct dd_per_prio { > struct io_stats_per_prio stats; > }; > > +enum { > + DD_DISPATCHING = 0, > +}; > + > struct deadline_data { > /* > * run time data > */ > + struct { > + spinlock_t lock; > + spinlock_t zone_lock; > + } ____cacheline_aligned_in_smp; > + > + unsigned long run_state; > > struct dd_per_prio per_prio[DD_PRIO_COUNT]; > > @@ -100,9 +110,6 @@ struct deadline_data { > int front_merges; > u32 async_depth; > int prio_aging_expire; > - > - spinlock_t lock; > - spinlock_t zone_lock; > }; > > /* Maps an I/O priority class to a deadline scheduler priority. */ > @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) > struct request *rq; > enum dd_prio prio; > > + /* > + * If someone else is already dispatching, skip this one. This will > + * defer the next dispatch event to when something completes, and could > + * potentially lower the queue depth for contended cases. > + */ > + if (test_bit(DD_DISPATCHING, &dd->run_state) || > + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) > + return NULL; > + This patch looks fine. BTW, the current dispatch is actually piggyback in the in-progress dispatch, see blk_mq_do_dispatch_sched(). And the correctness should depend on the looping dispatch & retry for nothing to dispatch in blk_mq_do_dispatch_sched(), maybe we need to document it here. Thanks, Ming
On 1/18/24 7:40 PM, Ming Lei wrote: > On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote: >> If we're entering request dispatch but someone else is already >> dispatching, then just skip this dispatch. We know IO is inflight and >> this will trigger another dispatch event for any completion. This will >> potentially cause slightly lower queue depth for contended cases, but >> those are slowed down anyway and this should not cause an issue. >> >> Signed-off-by: Jens Axboe <axboe@kernel.dk> >> --- >> block/mq-deadline.c | 28 +++++++++++++++++++++++----- >> 1 file changed, 23 insertions(+), 5 deletions(-) >> >> diff --git a/block/mq-deadline.c b/block/mq-deadline.c >> index f958e79277b8..9e0ab3ea728a 100644 >> --- a/block/mq-deadline.c >> +++ b/block/mq-deadline.c >> @@ -79,10 +79,20 @@ struct dd_per_prio { >> struct io_stats_per_prio stats; >> }; >> >> +enum { >> + DD_DISPATCHING = 0, >> +}; >> + >> struct deadline_data { >> /* >> * run time data >> */ >> + struct { >> + spinlock_t lock; >> + spinlock_t zone_lock; >> + } ____cacheline_aligned_in_smp; >> + >> + unsigned long run_state; >> >> struct dd_per_prio per_prio[DD_PRIO_COUNT]; >> >> @@ -100,9 +110,6 @@ struct deadline_data { >> int front_merges; >> u32 async_depth; >> int prio_aging_expire; >> - >> - spinlock_t lock; >> - spinlock_t zone_lock; >> }; >> >> /* Maps an I/O priority class to a deadline scheduler priority. */ >> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) >> struct request *rq; >> enum dd_prio prio; >> >> + /* >> + * If someone else is already dispatching, skip this one. This will >> + * defer the next dispatch event to when something completes, and could >> + * potentially lower the queue depth for contended cases. >> + */ >> + if (test_bit(DD_DISPATCHING, &dd->run_state) || >> + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) >> + return NULL; >> + > > This patch looks fine. > > BTW, the current dispatch is actually piggyback in the in-progress dispatch, > see blk_mq_do_dispatch_sched(). And the correctness should depend on > the looping dispatch & retry for nothing to dispatch in > blk_mq_do_dispatch_sched(), maybe we need to document it here. Thanks for taking a look, I'll add a comment.
diff --git a/block/mq-deadline.c b/block/mq-deadline.c index f958e79277b8..9e0ab3ea728a 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -79,10 +79,20 @@ struct dd_per_prio { struct io_stats_per_prio stats; }; +enum { + DD_DISPATCHING = 0, +}; + struct deadline_data { /* * run time data */ + struct { + spinlock_t lock; + spinlock_t zone_lock; + } ____cacheline_aligned_in_smp; + + unsigned long run_state; struct dd_per_prio per_prio[DD_PRIO_COUNT]; @@ -100,9 +110,6 @@ struct deadline_data { int front_merges; u32 async_depth; int prio_aging_expire; - - spinlock_t lock; - spinlock_t zone_lock; }; /* Maps an I/O priority class to a deadline scheduler priority. */ @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) struct request *rq; enum dd_prio prio; + /* + * If someone else is already dispatching, skip this one. This will + * defer the next dispatch event to when something completes, and could + * potentially lower the queue depth for contended cases. + */ + if (test_bit(DD_DISPATCHING, &dd->run_state) || + test_and_set_bit(DD_DISPATCHING, &dd->run_state)) + return NULL; + spin_lock(&dd->lock); rq = dd_dispatch_prio_aged_requests(dd, now); if (rq) @@ -616,6 +632,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) } unlock: + clear_bit(DD_DISPATCHING, &dd->run_state); spin_unlock(&dd->lock); return rq; @@ -706,6 +723,9 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) eq->elevator_data = dd; + spin_lock_init(&dd->lock); + spin_lock_init(&dd->zone_lock); + for (prio = 0; prio <= DD_PRIO_MAX; prio++) { struct dd_per_prio *per_prio = &dd->per_prio[prio]; @@ -722,8 +742,6 @@ static int dd_init_sched(struct request_queue *q, struct elevator_type *e) dd->last_dir = DD_WRITE; dd->fifo_batch = fifo_batch; dd->prio_aging_expire = prio_aging_expire; - spin_lock_init(&dd->lock); - spin_lock_init(&dd->zone_lock); /* We dispatch from request queue wide instead of hw queue */ blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);
If we're entering request dispatch but someone else is already dispatching, then just skip this dispatch. We know IO is inflight and this will trigger another dispatch event for any completion. This will potentially cause slightly lower queue depth for contended cases, but those are slowed down anyway and this should not cause an issue. Signed-off-by: Jens Axboe <axboe@kernel.dk> --- block/mq-deadline.c | 28 +++++++++++++++++++++++----- 1 file changed, 23 insertions(+), 5 deletions(-)