diff mbox series

[1/2] block/mq-deadline: serialize request dispatching

Message ID 20240118180541.930783-2-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series mq-deadline scalability improvements | expand

Commit Message

Jens Axboe Jan. 18, 2024, 6:04 p.m. UTC
If we're entering request dispatch but someone else is already
dispatching, then just skip this dispatch. We know IO is inflight and
this will trigger another dispatch event for any completion. This will
potentially cause slightly lower queue depth for contended cases, but
those are slowed down anyway and this should not cause an issue.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

Comments

Bart Van Assche Jan. 18, 2024, 6:24 p.m. UTC | #1
On 1/18/24 10:04, Jens Axboe wrote:
> If we're entering request dispatch but someone else is already
> dispatching, then just skip this dispatch. We know IO is inflight and
> this will trigger another dispatch event for any completion. This will
> potentially cause slightly lower queue depth for contended cases, but
> those are slowed down anyway and this should not cause an issue.

Shouldn't a performance optimization patch include numbers that show by
how much that patch improves the performance of different workloads?

>   struct deadline_data {
>   	/*
>   	 * run time data
>   	 */
> +	struct {
> +		spinlock_t lock;
> +		spinlock_t zone_lock;
> +	} ____cacheline_aligned_in_smp;

Is ____cacheline_aligned_in_smp useful here? struct deadline_data
is not embedded in any other data structure but is allocated with 
kzalloc_node().

> +	/*
> +	 * If someone else is already dispatching, skip this one. This will
> +	 * defer the next dispatch event to when something completes, and could
> +	 * potentially lower the queue depth for contended cases.
> +	 */
> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
> +		return NULL;
> +

The above code behaves similar to spin_trylock(). Has it been considered 
to use spin_trylock() instead?

Thanks,

Bart.
Jens Axboe Jan. 18, 2024, 6:45 p.m. UTC | #2
On 1/18/24 11:24 AM, Bart Van Assche wrote:
> On 1/18/24 10:04, Jens Axboe wrote:
>> If we're entering request dispatch but someone else is already
>> dispatching, then just skip this dispatch. We know IO is inflight and
>> this will trigger another dispatch event for any completion. This will
>> potentially cause slightly lower queue depth for contended cases, but
>> those are slowed down anyway and this should not cause an issue.
> 
> Shouldn't a performance optimization patch include numbers that show by
> how much that patch improves the performance of different workloads?

The commit messages may want some editing, but all of that is in the
cover letter that I'm sure you saw. This is just a POC/RFC posting.

>>   struct deadline_data {
>>       /*
>>        * run time data
>>        */
>> +    struct {
>> +        spinlock_t lock;
>> +        spinlock_t zone_lock;
>> +    } ____cacheline_aligned_in_smp;
> 
> Is ____cacheline_aligned_in_smp useful here? struct deadline_data
> is not embedded in any other data structure but is allocated with kzalloc_node().

It's not for alignment of deadline_data, it's for alignment within
deadline_data so that grabbing/releasing these locks don't share a
cacheline with other bits. We could ensure that deadline_data itself
is aligned as well, that might be a good idea.

>> +    /*
>> +     * If someone else is already dispatching, skip this one. This will
>> +     * defer the next dispatch event to when something completes, and could
>> +     * potentially lower the queue depth for contended cases.
>> +     */
>> +    if (test_bit(DD_DISPATCHING, &dd->run_state) ||
>> +        test_and_set_bit(DD_DISPATCHING, &dd->run_state))
>> +        return NULL;
>> +
> 
> The above code behaves similar to spin_trylock(). Has it been
> considered to use spin_trylock() instead?

Do you read the replies to the emails, from the other thread? First of
all, you'd need another lock for this. And secondly, this avoids a RMW
if it's already set. So no, I think this is cleaner than a separate
lock, and you'd still have cacheline bouncing on that for ALL calls if
you did it that way.
Bart Van Assche Jan. 18, 2024, 6:51 p.m. UTC | #3
On 1/18/24 10:45, Jens Axboe wrote:
> Do you read the replies to the emails, from the other thread?

Yes.
  > And secondly, this avoids a RMW if it's already set.
 From the spinlock implementation (something I looked up before I wrote
my previous reply):

static __always_inline int queued_spin_trylock(struct qspinlock *lock)
{
	int val = atomic_read(&lock->val);

	if (unlikely(val))
		return 0;

	return likely(atomic_try_cmpxchg_acquire(&lock->val, &val,
                                                  _Q_LOCKED_VAL));
}

I think this implementation does not perform a RMW if the spinlock is
already locked.

Thanks,

Bart.
Jens Axboe Jan. 18, 2024, 6:55 p.m. UTC | #4
On 1/18/24 11:51 AM, Bart Van Assche wrote:
> On 1/18/24 10:45, Jens Axboe wrote:
>> Do you read the replies to the emails, from the other thread?
> 
> Yes.

Well it's been a bit frustrating to say the least, as you seem to either
not read them or just ignore what's in them.

>  > And secondly, this avoids a RMW if it's already set.
> From the spinlock implementation (something I looked up before I wrote
> my previous reply):
> 
> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> {
>     int val = atomic_read(&lock->val);
> 
>     if (unlikely(val))
>         return 0;
> 
>     return likely(atomic_try_cmpxchg_acquire(&lock->val, &val,
>                                                  _Q_LOCKED_VAL));
> }
> 
> I think this implementation does not perform a RMW if the spinlock is
> already locked.

This looks like a generic trylock, is the same true for the arch
variants? Because your attempt either failed because of RMW or because
you are hammering it repeatedly, or both.

In any case, even if any trylock would avoid the initial atomic, it
doesn't give you anything that a bitop would not just do, and we use the
latter trick elsewhere in blk-mq.
Ming Lei Jan. 19, 2024, 2:40 a.m. UTC | #5
On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote:
> If we're entering request dispatch but someone else is already
> dispatching, then just skip this dispatch. We know IO is inflight and
> this will trigger another dispatch event for any completion. This will
> potentially cause slightly lower queue depth for contended cases, but
> those are slowed down anyway and this should not cause an issue.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/mq-deadline.c | 28 +++++++++++++++++++++++-----
>  1 file changed, 23 insertions(+), 5 deletions(-)
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index f958e79277b8..9e0ab3ea728a 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -79,10 +79,20 @@ struct dd_per_prio {
>  	struct io_stats_per_prio stats;
>  };
>  
> +enum {
> +	DD_DISPATCHING	= 0,
> +};
> +
>  struct deadline_data {
>  	/*
>  	 * run time data
>  	 */
> +	struct {
> +		spinlock_t lock;
> +		spinlock_t zone_lock;
> +	} ____cacheline_aligned_in_smp;
> +
> +	unsigned long run_state;
>  
>  	struct dd_per_prio per_prio[DD_PRIO_COUNT];
>  
> @@ -100,9 +110,6 @@ struct deadline_data {
>  	int front_merges;
>  	u32 async_depth;
>  	int prio_aging_expire;
> -
> -	spinlock_t lock;
> -	spinlock_t zone_lock;
>  };
>  
>  /* Maps an I/O priority class to a deadline scheduler priority. */
> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>  	struct request *rq;
>  	enum dd_prio prio;
>  
> +	/*
> +	 * If someone else is already dispatching, skip this one. This will
> +	 * defer the next dispatch event to when something completes, and could
> +	 * potentially lower the queue depth for contended cases.
> +	 */
> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
> +		return NULL;
> +

This patch looks fine.

BTW, the current dispatch is actually piggyback in the in-progress dispatch,
see blk_mq_do_dispatch_sched(). And the correctness should depend on
the looping dispatch & retry for nothing to dispatch in
blk_mq_do_dispatch_sched(), maybe we need to document it here.


Thanks,
Ming
Jens Axboe Jan. 19, 2024, 3:49 p.m. UTC | #6
On 1/18/24 7:40 PM, Ming Lei wrote:
> On Thu, Jan 18, 2024 at 11:04:56AM -0700, Jens Axboe wrote:
>> If we're entering request dispatch but someone else is already
>> dispatching, then just skip this dispatch. We know IO is inflight and
>> this will trigger another dispatch event for any completion. This will
>> potentially cause slightly lower queue depth for contended cases, but
>> those are slowed down anyway and this should not cause an issue.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  block/mq-deadline.c | 28 +++++++++++++++++++++++-----
>>  1 file changed, 23 insertions(+), 5 deletions(-)
>>
>> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
>> index f958e79277b8..9e0ab3ea728a 100644
>> --- a/block/mq-deadline.c
>> +++ b/block/mq-deadline.c
>> @@ -79,10 +79,20 @@ struct dd_per_prio {
>>  	struct io_stats_per_prio stats;
>>  };
>>  
>> +enum {
>> +	DD_DISPATCHING	= 0,
>> +};
>> +
>>  struct deadline_data {
>>  	/*
>>  	 * run time data
>>  	 */
>> +	struct {
>> +		spinlock_t lock;
>> +		spinlock_t zone_lock;
>> +	} ____cacheline_aligned_in_smp;
>> +
>> +	unsigned long run_state;
>>  
>>  	struct dd_per_prio per_prio[DD_PRIO_COUNT];
>>  
>> @@ -100,9 +110,6 @@ struct deadline_data {
>>  	int front_merges;
>>  	u32 async_depth;
>>  	int prio_aging_expire;
>> -
>> -	spinlock_t lock;
>> -	spinlock_t zone_lock;
>>  };
>>  
>>  /* Maps an I/O priority class to a deadline scheduler priority. */
>> @@ -600,6 +607,15 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>>  	struct request *rq;
>>  	enum dd_prio prio;
>>  
>> +	/*
>> +	 * If someone else is already dispatching, skip this one. This will
>> +	 * defer the next dispatch event to when something completes, and could
>> +	 * potentially lower the queue depth for contended cases.
>> +	 */
>> +	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
>> +	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
>> +		return NULL;
>> +
> 
> This patch looks fine.
> 
> BTW, the current dispatch is actually piggyback in the in-progress dispatch,
> see blk_mq_do_dispatch_sched(). And the correctness should depend on
> the looping dispatch & retry for nothing to dispatch in
> blk_mq_do_dispatch_sched(), maybe we need to document it here.

Thanks for taking a look, I'll add a comment.
diff mbox series

Patch

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..9e0ab3ea728a 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -79,10 +79,20 @@  struct dd_per_prio {
 	struct io_stats_per_prio stats;
 };
 
+enum {
+	DD_DISPATCHING	= 0,
+};
+
 struct deadline_data {
 	/*
 	 * run time data
 	 */
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long run_state;
 
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
@@ -100,9 +110,6 @@  struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +607,15 @@  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	/*
+	 * If someone else is already dispatching, skip this one. This will
+	 * defer the next dispatch event to when something completes, and could
+	 * potentially lower the queue depth for contended cases.
+	 */
+	if (test_bit(DD_DISPATCHING, &dd->run_state) ||
+	    test_and_set_bit(DD_DISPATCHING, &dd->run_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +632,7 @@  static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(DD_DISPATCHING, &dd->run_state);
 	spin_unlock(&dd->lock);
 
 	return rq;
@@ -706,6 +723,9 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	eq->elevator_data = dd;
 
+	spin_lock_init(&dd->lock);
+	spin_lock_init(&dd->zone_lock);
+
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
 
@@ -722,8 +742,6 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	dd->last_dir = DD_WRITE;
 	dd->fifo_batch = fifo_batch;
 	dd->prio_aging_expire = prio_aging_expire;
-	spin_lock_init(&dd->lock);
-	spin_lock_init(&dd->zone_lock);
 
 	/* We dispatch from request queue wide instead of hw queue */
 	blk_queue_flag_set(QUEUE_FLAG_SQ_SCHED, q);