diff mbox series

[2/2] block/mq-deadline: fallback to per-cpu insertion buckets under contention

Message ID 20240118180541.930783-3-axboe@kernel.dk (mailing list archive)
State New, archived
Headers show
Series mq-deadline scalability improvements | expand

Commit Message

Jens Axboe Jan. 18, 2024, 6:04 p.m. UTC
If we attempt to insert a list of requests but someone else is already
running an insertion, then fallback to queueing it internally and let
the existing inserter finish the operation.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 block/mq-deadline.c | 118 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 110 insertions(+), 8 deletions(-)

Comments

Keith Busch Jan. 18, 2024, 6:25 p.m. UTC | #1
On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote:
> +#define DD_CPU_BUCKETS		32
> +#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)

A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not
much and the fixed size makes the implementation simpler, but these add
up.
Jens Axboe Jan. 18, 2024, 6:28 p.m. UTC | #2
On 1/18/24 11:25 AM, Keith Busch wrote:
> On Thu, Jan 18, 2024 at 11:04:57AM -0700, Jens Axboe wrote:
>> +#define DD_CPU_BUCKETS		32
>> +#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
> 
> A bit of wasted space on machines with < 32 CPUs, no? Sure, it's not
> much and the fixed size makes the implementation simpler, but these add
> up.

True, could make it allocated instead, 32 was just pulled out of thin
air. Might make sense to make the magic 8 or something instead, that's
probably Good Enough in general. Or we can make it dependent on the
number of online CPUs when initialized. Honestly just didn't want to
overcomplicate this initial RFC.
Bart Van Assche Jan. 18, 2024, 6:31 p.m. UTC | #3
On 1/18/24 10:04, Jens Axboe wrote:
> If we attempt to insert a list of requests but someone else is already
> running an insertion, then fallback to queueing it internally and let
> the existing inserter finish the operation.

Because this patch adds significant complexity: what are the use cases
that benefit from this kind of optimization? Are these perhaps workloads
on systems with many CPU cores and fast storage? If the storage is fast,
why to use mq-deadline instead of "none" as I/O-scheduler?

Thanks,

Bart.
Jens Axboe Jan. 18, 2024, 6:33 p.m. UTC | #4
On 1/18/24 11:31 AM, Bart Van Assche wrote:
> On 1/18/24 10:04, Jens Axboe wrote:
>> If we attempt to insert a list of requests but someone else is already
>> running an insertion, then fallback to queueing it internally and let
>> the existing inserter finish the operation.
> 
> Because this patch adds significant complexity: what are the use cases
> that benefit from this kind of optimization? Are these perhaps workloads
> on systems with many CPU cores and fast storage? If the storage is fast,
> why to use mq-deadline instead of "none" as I/O-scheduler?

You and others complain that mq-deadline is slow and doesn't scale,
these two patches help improve that situation. Not sure why this is even
a question?
Bart Van Assche Jan. 18, 2024, 6:53 p.m. UTC | #5
On 1/18/24 10:33, Jens Axboe wrote:
> On 1/18/24 11:31 AM, Bart Van Assche wrote:
>> On 1/18/24 10:04, Jens Axboe wrote:
>>> If we attempt to insert a list of requests but someone else is already
>>> running an insertion, then fallback to queueing it internally and let
>>> the existing inserter finish the operation.
>>
>> Because this patch adds significant complexity: what are the use cases
>> that benefit from this kind of optimization? Are these perhaps workloads
>> on systems with many CPU cores and fast storage? If the storage is fast,
>> why to use mq-deadline instead of "none" as I/O-scheduler?
> 
> You and others complain that mq-deadline is slow and doesn't scale,
> these two patches help improve that situation. Not sure why this is even
> a question?

How much does this patch improve performance?

Thanks,

Bart.
Jens Axboe Jan. 18, 2024, 6:56 p.m. UTC | #6
On 1/18/24 11:53 AM, Bart Van Assche wrote:
> On 1/18/24 10:33, Jens Axboe wrote:
>> On 1/18/24 11:31 AM, Bart Van Assche wrote:
>>> On 1/18/24 10:04, Jens Axboe wrote:
>>>> If we attempt to insert a list of requests but someone else is already
>>>> running an insertion, then fallback to queueing it internally and let
>>>> the existing inserter finish the operation.
>>>
>>> Because this patch adds significant complexity: what are the use cases
>>> that benefit from this kind of optimization? Are these perhaps workloads
>>> on systems with many CPU cores and fast storage? If the storage is fast,
>>> why to use mq-deadline instead of "none" as I/O-scheduler?
>>
>> You and others complain that mq-deadline is slow and doesn't scale,
>> these two patches help improve that situation. Not sure why this is even
>> a question?
> 
> How much does this patch improve performance?

Do you need me to link the cover letter that you were CC'ed on?
Bart Van Assche Jan. 18, 2024, 8:46 p.m. UTC | #7
On 1/18/24 10:56, Jens Axboe wrote:
> Do you need me to link the cover letter that you were CC'ed on?

There is no reason to use such an aggressive tone in your emails.

In the cover letter I see performance numbers for the patch series in
its entirety but not for the individual patches. I'd like to know by
how much this patch by itself improves performance because whether or
not I will review this patch will depend on that data.

Thanks,

Bart.
Jens Axboe Jan. 18, 2024, 8:52 p.m. UTC | #8
On 1/18/24 1:46 PM, Bart Van Assche wrote:
> On 1/18/24 10:56, Jens Axboe wrote:
>> Do you need me to link the cover letter that you were CC'ed on?
> 
> There is no reason to use such an aggressive tone in your emails.

I'm getting frustrated with you because I need to say the same things
multiple times, and it doesn't seem like it's getting received at the
other end. And you had clearly seen the results, which means that rather
than the passive aggressive question, you could have said

"It'd probably be useful to include some performance numbers in
 the commmit messages themselves as well."

which would be received way differently than asking a question that you
already have the answer to.

> In the cover letter I see performance numbers for the patch series in
> its entirety but not for the individual patches. I'd like to know by
> how much this patch by itself improves performance because whether or
> not I will review this patch will depend on that data.

You can't really split them up, as you need both to see the real
benefit. The only reason they are split is because it makes sense to do
so in terms of the code, as they are two different paths and points of
contention. This obviously makes patch 2 look better than it is, and
that would be the case even if I swapped the order of them as well. As
mentioned in a previous reply, I'll be editing the commit messages and
probably just include the various performance results in patch 2 and
reference patch 1 from that too.
Bart Van Assche Jan. 19, 2024, 11:11 p.m. UTC | #9
On 1/18/24 12:52, Jens Axboe wrote:
> And you had clearly seen the results, which means that rather
> than the passive aggressive question, you could have said [ ... ]

I'm never passive aggressive. If that is how my message was received I 
want to apologize.

Bart.
diff mbox series

Patch

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 9e0ab3ea728a..eeeaaff189e1 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -81,8 +81,17 @@  struct dd_per_prio {
 
 enum {
 	DD_DISPATCHING	= 0,
+	DD_INSERTING	= 1,
 };
 
+#define DD_CPU_BUCKETS		32
+#define DD_CPU_BUCKETS_MASK	(DD_CPU_BUCKETS - 1)
+
+struct dd_bucket_list {
+	struct list_head list;
+	spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
 struct deadline_data {
 	/*
 	 * run time data
@@ -94,6 +103,9 @@  struct deadline_data {
 
 	unsigned long run_state;
 
+	atomic_t insert_seq;
+	struct dd_bucket_list bucket_lists[DD_CPU_BUCKETS];
+
 	struct dd_per_prio per_prio[DD_PRIO_COUNT];
 
 	/* Data direction of latest dispatched request. */
@@ -711,7 +723,7 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 	struct deadline_data *dd;
 	struct elevator_queue *eq;
 	enum dd_prio prio;
-	int ret = -ENOMEM;
+	int i, ret = -ENOMEM;
 
 	eq = elevator_alloc(q, e);
 	if (!eq)
@@ -725,6 +737,12 @@  static int dd_init_sched(struct request_queue *q, struct elevator_type *e)
 
 	spin_lock_init(&dd->lock);
 	spin_lock_init(&dd->zone_lock);
+	atomic_set(&dd->insert_seq, 0);
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		INIT_LIST_HEAD(&dd->bucket_lists[i].list);
+		spin_lock_init(&dd->bucket_lists[i].lock);
+	}
 
 	for (prio = 0; prio <= DD_PRIO_MAX; prio++) {
 		struct dd_per_prio *per_prio = &dd->per_prio[prio];
@@ -876,6 +894,67 @@  static void dd_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
 	}
 }
 
+static void dd_dispatch_from_buckets(struct deadline_data *dd,
+				     struct list_head *list)
+{
+	int i;
+
+	for (i = 0; i < DD_CPU_BUCKETS; i++) {
+		struct dd_bucket_list *bucket = &dd->bucket_lists[i];
+
+		if (list_empty_careful(&bucket->list))
+			continue;
+		spin_lock(&bucket->lock);
+		list_splice_init(&bucket->list, list);
+		spin_unlock(&bucket->lock);
+	}
+}
+
+/*
+ * If we can grab the dd->lock, then just return and do the insertion as per
+ * usual. If not, add to one of our internal buckets, and afterwards recheck
+ * if if we should retry.
+ */
+static bool dd_insert_to_bucket(struct deadline_data *dd,
+				struct list_head *list, int *seq)
+	__acquires(&dd->lock)
+{
+	struct dd_bucket_list *bucket;
+	int next_seq;
+
+	*seq = atomic_read(&dd->insert_seq);
+
+	if (spin_trylock(&dd->lock))
+		return false;
+	if (!test_bit(DD_INSERTING, &dd->run_state)) {
+		spin_lock(&dd->lock);
+		return false;
+	}
+
+	*seq = atomic_inc_return(&dd->insert_seq);
+
+	bucket = &dd->bucket_lists[get_cpu() & DD_CPU_BUCKETS_MASK];
+	spin_lock(&bucket->lock);
+	list_splice_init(list, &bucket->list);
+	spin_unlock(&bucket->lock);
+	put_cpu();
+
+	/*
+	 * If seq still matches, we should be safe to just exit with the
+	 * pending requests queued in a bucket.
+	 */
+	next_seq = atomic_inc_return(&dd->insert_seq);
+	if (next_seq == *seq + 1)
+		return true;
+
+	/*
+	 * Seq changed, be safe and grab the lock and insert. Don't update
+	 * sequence, so that we flusht the buckets too.
+	 */
+	spin_lock(&dd->lock);
+	return false;
+}
+
 /*
  * Called from blk_mq_insert_request() or blk_mq_dispatch_plug_list().
  */
@@ -886,16 +965,39 @@  static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 	struct request_queue *q = hctx->queue;
 	struct deadline_data *dd = q->elevator->elevator_data;
 	LIST_HEAD(free);
+	int seq;
 
-	spin_lock(&dd->lock);
-	while (!list_empty(list)) {
-		struct request *rq;
+	/*
+	 * If dispatch is busy and we ended up adding to our internal bucket,
+	 * then we're done for now.
+	 */
+	if (dd_insert_to_bucket(dd, list, &seq))
+		return;
+
+	set_bit(DD_INSERTING, &dd->run_state);
+	do {
+		int next_seq;
+
+		while (!list_empty(list)) {
+			struct request *rq;
+
+			rq = list_first_entry(list, struct request, queuelist);
+			list_del_init(&rq->queuelist);
+			dd_insert_request(hctx, rq, flags, &free);
+		}
+
+		/*
+		 * If sequence changed, flush internal buckets
+		 */
+		next_seq = atomic_inc_return(&dd->insert_seq);
+		if (next_seq == seq + 1)
+			break;
+		seq = next_seq;
+		dd_dispatch_from_buckets(dd, list);
+	} while (1);
 
-		rq = list_first_entry(list, struct request, queuelist);
-		list_del_init(&rq->queuelist);
-		dd_insert_request(hctx, rq, flags, &free);
-	}
 	spin_unlock(&dd->lock);
+	clear_bit(DD_INSERTING, &dd->run_state);
 
 	blk_mq_free_requests(&free);
 }