diff mbox

[4/6] blk-mq: use EWMA to estimate congestion threshold

Message ID 20170711182103.11461-5-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ming Lei July 11, 2017, 6:21 p.m. UTC
When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
consider that there is congestion in either low level
driver or hardware.

This patch uses EWMA to estimate this congestion threshold,
then this threshold can be used to detect/avoid congestion.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 14 ++++++++++++++
 block/blk-mq.h         |  9 +++++++++
 include/linux/blk-mq.h |  2 ++
 3 files changed, 25 insertions(+)

Comments

Jens Axboe July 11, 2017, 6:25 p.m. UTC | #1
On 07/11/2017 12:21 PM, Ming Lei wrote:
> When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> consider that there is congestion in either low level
> driver or hardware.
> 
> This patch uses EWMA to estimate this congestion threshold,
> then this threshold can be used to detect/avoid congestion.

This whole patch set lacks some sort of reasoning why this
makes sense. I'm assuming you want to reduce unnecessary
restarts of the queue? I would much rather ensure that we only
start when we absolutely have to to begin with, I'm pretty sure
we have a number of cases where that is not so.

What happens with fluid congestion boundaries, with shared tags?
Jens Axboe July 11, 2017, 6:39 p.m. UTC | #2
On 07/11/2017 12:21 PM, Ming Lei wrote:
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 60b01c0309bc..c4516d2a2d2c 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
>  	return hctx->nr_ctx && hctx->tags;
>  }
>  
> +/* borrowed from bcache */
> +#define ewma_add(ewma, val, weight, factor)                             \
> +({                                                                      \
> +        (ewma) *= (weight) - 1;                                         \
> +        (ewma) += (val) << factor;                                      \
> +        (ewma) /= (weight);                                             \
> +        (ewma) >> factor;                                               \
> +})

Just put that in blk_mq_update_req_dispatch_busy(), or at least make it
a static function in blk-mq.c above that function. You don't use factor
at all.
Bart Van Assche July 11, 2017, 9:02 p.m. UTC | #3
On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> consider that there is congestion in either low level
> driver or hardware.
> 
> This patch uses EWMA to estimate this congestion threshold,
> then this threshold can be used to detect/avoid congestion.

Hello Ming,

Does EWMA stand for "exponentially weighted moving average" in the context of
this patch? If so, please mention this.

> +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct sbitmap_queue *sbq;
> +	unsigned depth;
> +
> +	sbq = &hctx->tags->bitmap_tags;
> +	depth = sbitmap_weight(&sbq->sb);
> +
> +	/* use EWMA to estimate a threshold for detecting congestion */
> +	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
> +}

This function has been named after the context it is called from. Wouldn't it
be more clear to change the name of this function into something that refers to
what this function does, e.g. blk_mq_update_avg_busy_threshold()?

Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drivers
is too complicated for this approach to be effective. If you want to proceed
with this approach I think it should be possible for block drivers to opt out
of the mechanism introduced in the next patch.

> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 60b01c0309bc..c4516d2a2d2c 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
>  	return hctx->nr_ctx && hctx->tags;
>  }
>  
> +/* borrowed from bcache */
> +#define ewma_add(ewma, val, weight, factor)                             \
> +({                                                                      \
> +        (ewma) *= (weight) - 1;                                         \
> +        (ewma) += (val) << factor;                                      \
> +        (ewma) /= (weight);                                             \
> +        (ewma) >> factor;                                               \
> +})

Sorry but this does not match how others define an exponentially weighted moving
average. As far as I know the ewma values should be updated as follows:

   new_ewma = w * val + (1 - w) * current_ewma

where 0 < w <= 1 is a rational number (typically 0.05 <= w <= 0.3). See also
https://en.wikipedia.org/wiki/EWMA_chart.

Bart.
Ming Lei July 12, 2017, 2:30 a.m. UTC | #4
On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> On 07/11/2017 12:21 PM, Ming Lei wrote:
> > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> > consider that there is congestion in either low level
> > driver or hardware.
> > 
> > This patch uses EWMA to estimate this congestion threshold,
> > then this threshold can be used to detect/avoid congestion.
> 
> This whole patch set lacks some sort of reasoning why this
> makes sense. I'm assuming you want to reduce unnecessary
> restarts of the queue?

When low level drivers returns BLK_STS_RESOURCE, it means
either driver or hardware can't handle the incoming requests.
The issue is actually one problem of congestion control, IMO.

There are some ways taken in drivers to deal with the issue:

	1) both virtio-blk/xen-blkfront just stops queue in this
	case, then restart the queue when one request is completed.

	2) virtio-scsi chooses to not stop queue

For reaching good IO performance, we often do the following two
things:
	- try to make the queue pipeline as full by increasing queue depth
	- run several tasks to do IO at the same time

So for the soltion 1) above, it can't be efficient, because when one request
is completed to restart queue, there are lots of requests in the queue,
and dispatching these requests will cause to return lots of BLK_STS_RESOURCE
too. The dispatch isn't cheap at all, we need to acquire sw queue
lock/scheduler lock/hw queue lock /low level queue lock in the whole
path, so we should try to avoid dispatching to a busy queue.

For the solution 2), it simply doesn't stop queue, unnecessary
dispatching to a busy queue just wastes CPU, and has the same issue
with 1) too. As you can see in the commit log of patch 5, big
improvement is observed with this simple implementation:

	sequential read test(libaio, bs:4k, direct io, queue depth:64, 8 jobs)
	on virtio-scsi shows that:
	        - CPU utilization decreases ~20%
	        - IOPS increases by ~10%

EWMA is used to estimate one average queue depth, with which
the queue will often be busy. In this patchset, the average queue
depth when queue becomes busy is called congestion threshold.

I chooose EWMA because it is one usual/common way to figure out
weight average value, and it has been used in several fields of
kernel(wifi rate control, sequential I/O estimation in bcache,
network,...)  Or there is other better way to do that? I am happy
to take it, even in the future we may support more than one approaches
to address the issue.

Also another motivation is to unexport APIs of start/stop queue.

> I would much rather ensure that we only
> start when we absolutely have to to begin with, I'm pretty sure
> we have a number of cases where that is not so.

Could you share us these cases? I believe we can integrate different
approaches for congestion control to address different requirement.

> 
> What happens with fluid congestion boundaries, with shared tags?

The approach in this patch should work, but the threshold may not
be accurate in this way, one simple method is to use the average
tag weight in EWMA, like this:

	sbitmap_weight() / hctx->tags->active_queues
Ming Lei July 12, 2017, 3:20 a.m. UTC | #5
On Tue, Jul 11, 2017 at 12:39:35PM -0600, Jens Axboe wrote:
> On 07/11/2017 12:21 PM, Ming Lei wrote:
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 60b01c0309bc..c4516d2a2d2c 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
> >  	return hctx->nr_ctx && hctx->tags;
> >  }
> >  
> > +/* borrowed from bcache */
> > +#define ewma_add(ewma, val, weight, factor)                             \
> > +({                                                                      \
> > +        (ewma) *= (weight) - 1;                                         \
> > +        (ewma) += (val) << factor;                                      \
> > +        (ewma) /= (weight);                                             \
> > +        (ewma) >> factor;                                               \
> > +})
> 
> Just put that in blk_mq_update_req_dispatch_busy(), or at least make it
> a static function in blk-mq.c above that function. You don't use factor
> at all.

OK, will do it in V2.
Ming Lei July 12, 2017, 3:43 a.m. UTC | #6
On Tue, Jul 11, 2017 at 09:02:13PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote:
> > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can
> > consider that there is congestion in either low level
> > driver or hardware.
> > 
> > This patch uses EWMA to estimate this congestion threshold,
> > then this threshold can be used to detect/avoid congestion.
> 
> Hello Ming,
> 
> Does EWMA stand for "exponentially weighted moving average" in the context of
> this patch? If so, please mention this.

Yes and OK.

> 
> > +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
> > +{
> > +	struct sbitmap_queue *sbq;
> > +	unsigned depth;
> > +
> > +	sbq = &hctx->tags->bitmap_tags;
> > +	depth = sbitmap_weight(&sbq->sb);
> > +
> > +	/* use EWMA to estimate a threshold for detecting congestion */
> > +	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
> > +}
> 
> This function has been named after the context it is called from. Wouldn't it
> be more clear to change the name of this function into something that refers to
> what this function does, e.g. blk_mq_update_avg_busy_threshold()?

In the next patch, more things will be done in this function.

> 
> Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drivers
> is too complicated for this approach to be effective. If you want to proceed
> with this approach I think it should be possible for block drivers to opt out
> of the mechanism introduced in the next patch.

dm might be a bit special, but for SCSI I suggest to use that since I see
obvious improvement in virtio-scsi.

But it depends on performance, if there isn't any perf loss, I'd rather
to do for all(include dm), even we can develop other smart way for
special requirement if there are.

> 
> > diff --git a/block/blk-mq.h b/block/blk-mq.h
> > index 60b01c0309bc..c4516d2a2d2c 100644
> > --- a/block/blk-mq.h
> > +++ b/block/blk-mq.h
> > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
> >  	return hctx->nr_ctx && hctx->tags;
> >  }
> >  
> > +/* borrowed from bcache */
> > +#define ewma_add(ewma, val, weight, factor)                             \
> > +({                                                                      \
> > +        (ewma) *= (weight) - 1;                                         \
> > +        (ewma) += (val) << factor;                                      \
> > +        (ewma) /= (weight);                                             \
> > +        (ewma) >> factor;                                               \
> > +})
> 
> Sorry but this does not match how others define an exponentially weighted moving
> average. As far as I know the ewma values should be updated as follows:
> 
>    new_ewma = w * val + (1 - w) * current_ewma
> 
> where 0 < w <= 1 is a rational number (typically 0.05 <= w <= 0.3). See also
> https://en.wikipedia.org/wiki/EWMA_chart.

Yes, for the way in this patch, w is 1/8, and factor is zero, it is just
for computer to do it efficiently, no big difference with definition in
paper, and as you see, ewma_add() is borrowed from bcache.
Bart Van Assche July 12, 2017, 3:39 p.m. UTC | #7
On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > What happens with fluid congestion boundaries, with shared tags?
> 
> The approach in this patch should work, but the threshold may not
> be accurate in this way, one simple method is to use the average
> tag weight in EWMA, like this:
> 
> 	sbitmap_weight() / hctx->tags->active_queues

Hello Ming,

That approach would result in a severe performance degradation. "active_queues"
namely represents the number of queues against which I/O ever has been queued.
If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
responding and if the queue depth would also be 64 then the approach you
proposed will reduce the effective queue depth per LUN from 64 to 1.

Bart.
Ming Lei July 13, 2017, 10:43 a.m. UTC | #8
On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > What happens with fluid congestion boundaries, with shared tags?
> > 
> > The approach in this patch should work, but the threshold may not
> > be accurate in this way, one simple method is to use the average
> > tag weight in EWMA, like this:
> > 
> > 	sbitmap_weight() / hctx->tags->active_queues
> 
> Hello Ming,
> 
> That approach would result in a severe performance degradation. "active_queues"
> namely represents the number of queues against which I/O ever has been queued.
> If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
> responding and if the queue depth would also be 64 then the approach you
> proposed will reduce the effective queue depth per LUN from 64 to 1.

No, this approach does _not_ reduce the effective queue depth, it only
stops the queue for a while when the queue is busy enough.

In this case, there may not have congestion because for blk-mq at most allows
to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue().
Then get_driver_tag() can only allow to return one pending tag at most to the
request_queue(LUN).

The algorithm in this patch only starts to work when congestion happens,
that said it is only run when BLK_STS_RESOURCE is returned from .queue_rq().
This approach is for avoiding to dispatch requests to one busy queue
unnecessarily, so that we don't need to heat CPU unnecessarily, and
merge gets improved meantime.
Bart Van Assche July 13, 2017, 2:56 p.m. UTC | #9
On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > What happens with fluid congestion boundaries, with shared tags?
> > > 
> > > The approach in this patch should work, but the threshold may not
> > > be accurate in this way, one simple method is to use the average
> > > tag weight in EWMA, like this:
> > > 
> > > 	sbitmap_weight() / hctx->tags->active_queues
> > 
> > Hello Ming,
> > 
> > That approach would result in a severe performance degradation. "active_queues"
> > namely represents the number of queues against which I/O ever has been queued.
> > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
> > responding and if the queue depth would also be 64 then the approach you
> > proposed will reduce the effective queue depth per LUN from 64 to 1.
> 
> No, this approach does _not_ reduce the effective queue depth, it only
> stops the queue for a while when the queue is busy enough.
>
> In this case, there may not have congestion because for blk-mq at most allows
> to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue().

Hello Ming,

hctx_may_queue() severely limits the queue depth if many LUNs are associated
with the same SCSI host. I think that this is a performance regression
compared to scsi-sq and that this performance regression should be fixed.

Bart.
Ming Lei July 13, 2017, 3:32 p.m. UTC | #10
On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote:
> On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote:
> > On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote:
> > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote:
> > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote:
> > > > > What happens with fluid congestion boundaries, with shared tags?
> > > > 
> > > > The approach in this patch should work, but the threshold may not
> > > > be accurate in this way, one simple method is to use the average
> > > > tag weight in EWMA, like this:
> > > > 
> > > > 	sbitmap_weight() / hctx->tags->active_queues
> > > 
> > > Hello Ming,
> > > 
> > > That approach would result in a severe performance degradation. "active_queues"
> > > namely represents the number of queues against which I/O ever has been queued.
> > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are
> > > responding and if the queue depth would also be 64 then the approach you
> > > proposed will reduce the effective queue depth per LUN from 64 to 1.
> > 
> > No, this approach does _not_ reduce the effective queue depth, it only
> > stops the queue for a while when the queue is busy enough.
> >
> > In this case, there may not have congestion because for blk-mq at most allows
> > to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue().
> 
> Hello Ming,
> 
> hctx_may_queue() severely limits the queue depth if many LUNs are associated
> with the same SCSI host. I think that this is a performance regression
> compared to scsi-sq and that this performance regression should be fixed.

IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:

	- how many LUNs do you run IO on concurrently?
	- evaluate the perf on single LUN or multi LUN?

BTW, active_queues is a runtime variable which accounts the actual active
queues in use.
Bart Van Assche July 13, 2017, 5:35 p.m. UTC | #11
On Thu, 2017-07-13 at 23:32 +0800, Ming Lei wrote:
> On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote:
> > hctx_may_queue() severely limits the queue depth if many LUNs are associated
> > with the same SCSI host. I think that this is a performance regression
> > compared to scsi-sq and that this performance regression should be fixed.
> 
> IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq:
> 
> 	- how many LUNs do you run IO on concurrently?
> 	- evaluate the perf on single LUN or multi LUN?
> 
> BTW, active_queues is a runtime variable which accounts the actual active
> queues in use.

Hello Ming,

What I described can be verified easily by running fio against a single LUN
of a SCSI host with which a large number of LUNs are associated.

BTW, something I overlooked in my analysis of the active_queues variable is
that it is not only decremented if a queue is destroyed but also if a queue
is idle during (block layer request timeout) seconds. So the problem I
described will only occur if some software submits I/O requests periodically
to all SCSI LUNs with a shorter interval than the I/O timeout. I'm not sure
any Linux software does this today. As an example, the time between path
checks by the multipathd TUR checker typically exceeds the timeout of a
single TUR check.

Bart.
diff mbox

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6e0fc80aa151..da50c187c508 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -976,6 +976,18 @@  static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
+static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx)
+{
+	struct sbitmap_queue *sbq;
+	unsigned depth;
+
+	sbq = &hctx->tags->bitmap_tags;
+	depth = sbitmap_weight(&sbq->sb);
+
+	/* use EWMA to estimate a threshold for detecting congestion */
+	ewma_add(hctx->avg_busy_threshold, depth, 8, 0);
+}
+
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 {
 	struct blk_mq_hw_ctx *hctx;
@@ -1064,6 +1076,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list)
 
 		spin_lock(&hctx->lock);
 		list_splice_init(list, &hctx->dispatch);
+		blk_mq_update_req_dispatch_busy(hctx);
 		spin_unlock(&hctx->lock);
 
 		/*
@@ -1468,6 +1481,7 @@  static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx,
 {
 	spin_lock(&hctx->lock);
 	list_add(&rq->queuelist, &hctx->dispatch);
+	blk_mq_update_req_dispatch_busy(hctx);
 	spin_unlock(&hctx->lock);
 
 	blk_mq_run_hw_queue(hctx, false);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 60b01c0309bc..c4516d2a2d2c 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -133,4 +133,13 @@  static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 	return hctx->nr_ctx && hctx->tags;
 }
 
+/* borrowed from bcache */
+#define ewma_add(ewma, val, weight, factor)                             \
+({                                                                      \
+        (ewma) *= (weight) - 1;                                         \
+        (ewma) += (val) << factor;                                      \
+        (ewma) /= (weight);                                             \
+        (ewma) >> factor;                                               \
+})
+
 #endif
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 14542308d25b..8694fb39cd80 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -22,6 +22,8 @@  struct blk_mq_hw_ctx {
 
 	unsigned long		flags;		/* BLK_MQ_F_* flags */
 
+	unsigned long		avg_busy_threshold;
+
 	void			*sched_data;
 	struct request_queue	*queue;
 	struct blk_flush_queue	*fq;