Message ID | 20170711182103.11461-5-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 07/11/2017 12:21 PM, Ming Lei wrote: > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can > consider that there is congestion in either low level > driver or hardware. > > This patch uses EWMA to estimate this congestion threshold, > then this threshold can be used to detect/avoid congestion. This whole patch set lacks some sort of reasoning why this makes sense. I'm assuming you want to reduce unnecessary restarts of the queue? I would much rather ensure that we only start when we absolutely have to to begin with, I'm pretty sure we have a number of cases where that is not so. What happens with fluid congestion boundaries, with shared tags?
On 07/11/2017 12:21 PM, Ming Lei wrote: > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 60b01c0309bc..c4516d2a2d2c 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) > return hctx->nr_ctx && hctx->tags; > } > > +/* borrowed from bcache */ > +#define ewma_add(ewma, val, weight, factor) \ > +({ \ > + (ewma) *= (weight) - 1; \ > + (ewma) += (val) << factor; \ > + (ewma) /= (weight); \ > + (ewma) >> factor; \ > +}) Just put that in blk_mq_update_req_dispatch_busy(), or at least make it a static function in blk-mq.c above that function. You don't use factor at all.
On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote: > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can > consider that there is congestion in either low level > driver or hardware. > > This patch uses EWMA to estimate this congestion threshold, > then this threshold can be used to detect/avoid congestion. Hello Ming, Does EWMA stand for "exponentially weighted moving average" in the context of this patch? If so, please mention this. > +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx) > +{ > + struct sbitmap_queue *sbq; > + unsigned depth; > + > + sbq = &hctx->tags->bitmap_tags; > + depth = sbitmap_weight(&sbq->sb); > + > + /* use EWMA to estimate a threshold for detecting congestion */ > + ewma_add(hctx->avg_busy_threshold, depth, 8, 0); > +} This function has been named after the context it is called from. Wouldn't it be more clear to change the name of this function into something that refers to what this function does, e.g. blk_mq_update_avg_busy_threshold()? Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drivers is too complicated for this approach to be effective. If you want to proceed with this approach I think it should be possible for block drivers to opt out of the mechanism introduced in the next patch. > diff --git a/block/blk-mq.h b/block/blk-mq.h > index 60b01c0309bc..c4516d2a2d2c 100644 > --- a/block/blk-mq.h > +++ b/block/blk-mq.h > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) > return hctx->nr_ctx && hctx->tags; > } > > +/* borrowed from bcache */ > +#define ewma_add(ewma, val, weight, factor) \ > +({ \ > + (ewma) *= (weight) - 1; \ > + (ewma) += (val) << factor; \ > + (ewma) /= (weight); \ > + (ewma) >> factor; \ > +}) Sorry but this does not match how others define an exponentially weighted moving average. As far as I know the ewma values should be updated as follows: new_ewma = w * val + (1 - w) * current_ewma where 0 < w <= 1 is a rational number (typically 0.05 <= w <= 0.3). See also https://en.wikipedia.org/wiki/EWMA_chart. Bart.
On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > On 07/11/2017 12:21 PM, Ming Lei wrote: > > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can > > consider that there is congestion in either low level > > driver or hardware. > > > > This patch uses EWMA to estimate this congestion threshold, > > then this threshold can be used to detect/avoid congestion. > > This whole patch set lacks some sort of reasoning why this > makes sense. I'm assuming you want to reduce unnecessary > restarts of the queue? When low level drivers returns BLK_STS_RESOURCE, it means either driver or hardware can't handle the incoming requests. The issue is actually one problem of congestion control, IMO. There are some ways taken in drivers to deal with the issue: 1) both virtio-blk/xen-blkfront just stops queue in this case, then restart the queue when one request is completed. 2) virtio-scsi chooses to not stop queue For reaching good IO performance, we often do the following two things: - try to make the queue pipeline as full by increasing queue depth - run several tasks to do IO at the same time So for the soltion 1) above, it can't be efficient, because when one request is completed to restart queue, there are lots of requests in the queue, and dispatching these requests will cause to return lots of BLK_STS_RESOURCE too. The dispatch isn't cheap at all, we need to acquire sw queue lock/scheduler lock/hw queue lock /low level queue lock in the whole path, so we should try to avoid dispatching to a busy queue. For the solution 2), it simply doesn't stop queue, unnecessary dispatching to a busy queue just wastes CPU, and has the same issue with 1) too. As you can see in the commit log of patch 5, big improvement is observed with this simple implementation: sequential read test(libaio, bs:4k, direct io, queue depth:64, 8 jobs) on virtio-scsi shows that: - CPU utilization decreases ~20% - IOPS increases by ~10% EWMA is used to estimate one average queue depth, with which the queue will often be busy. In this patchset, the average queue depth when queue becomes busy is called congestion threshold. I chooose EWMA because it is one usual/common way to figure out weight average value, and it has been used in several fields of kernel(wifi rate control, sequential I/O estimation in bcache, network,...) Or there is other better way to do that? I am happy to take it, even in the future we may support more than one approaches to address the issue. Also another motivation is to unexport APIs of start/stop queue. > I would much rather ensure that we only > start when we absolutely have to to begin with, I'm pretty sure > we have a number of cases where that is not so. Could you share us these cases? I believe we can integrate different approaches for congestion control to address different requirement. > > What happens with fluid congestion boundaries, with shared tags? The approach in this patch should work, but the threshold may not be accurate in this way, one simple method is to use the average tag weight in EWMA, like this: sbitmap_weight() / hctx->tags->active_queues
On Tue, Jul 11, 2017 at 12:39:35PM -0600, Jens Axboe wrote: > On 07/11/2017 12:21 PM, Ming Lei wrote: > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 60b01c0309bc..c4516d2a2d2c 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) > > return hctx->nr_ctx && hctx->tags; > > } > > > > +/* borrowed from bcache */ > > +#define ewma_add(ewma, val, weight, factor) \ > > +({ \ > > + (ewma) *= (weight) - 1; \ > > + (ewma) += (val) << factor; \ > > + (ewma) /= (weight); \ > > + (ewma) >> factor; \ > > +}) > > Just put that in blk_mq_update_req_dispatch_busy(), or at least make it > a static function in blk-mq.c above that function. You don't use factor > at all. OK, will do it in V2.
On Tue, Jul 11, 2017 at 09:02:13PM +0000, Bart Van Assche wrote: > On Wed, 2017-07-12 at 02:21 +0800, Ming Lei wrote: > > When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can > > consider that there is congestion in either low level > > driver or hardware. > > > > This patch uses EWMA to estimate this congestion threshold, > > then this threshold can be used to detect/avoid congestion. > > Hello Ming, > > Does EWMA stand for "exponentially weighted moving average" in the context of > this patch? If so, please mention this. Yes and OK. > > > +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx) > > +{ > > + struct sbitmap_queue *sbq; > > + unsigned depth; > > + > > + sbq = &hctx->tags->bitmap_tags; > > + depth = sbitmap_weight(&sbq->sb); > > + > > + /* use EWMA to estimate a threshold for detecting congestion */ > > + ewma_add(hctx->avg_busy_threshold, depth, 8, 0); > > +} > > This function has been named after the context it is called from. Wouldn't it > be more clear to change the name of this function into something that refers to > what this function does, e.g. blk_mq_update_avg_busy_threshold()? In the next patch, more things will be done in this function. > > Additionally, I think that the behavior of e.g. the SCSI and dm-mpath drivers > is too complicated for this approach to be effective. If you want to proceed > with this approach I think it should be possible for block drivers to opt out > of the mechanism introduced in the next patch. dm might be a bit special, but for SCSI I suggest to use that since I see obvious improvement in virtio-scsi. But it depends on performance, if there isn't any perf loss, I'd rather to do for all(include dm), even we can develop other smart way for special requirement if there are. > > > diff --git a/block/blk-mq.h b/block/blk-mq.h > > index 60b01c0309bc..c4516d2a2d2c 100644 > > --- a/block/blk-mq.h > > +++ b/block/blk-mq.h > > @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) > > return hctx->nr_ctx && hctx->tags; > > } > > > > +/* borrowed from bcache */ > > +#define ewma_add(ewma, val, weight, factor) \ > > +({ \ > > + (ewma) *= (weight) - 1; \ > > + (ewma) += (val) << factor; \ > > + (ewma) /= (weight); \ > > + (ewma) >> factor; \ > > +}) > > Sorry but this does not match how others define an exponentially weighted moving > average. As far as I know the ewma values should be updated as follows: > > new_ewma = w * val + (1 - w) * current_ewma > > where 0 < w <= 1 is a rational number (typically 0.05 <= w <= 0.3). See also > https://en.wikipedia.org/wiki/EWMA_chart. Yes, for the way in this patch, w is 1/8, and factor is zero, it is just for computer to do it efficiently, no big difference with definition in paper, and as you see, ewma_add() is borrowed from bcache.
On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > What happens with fluid congestion boundaries, with shared tags? > > The approach in this patch should work, but the threshold may not > be accurate in this way, one simple method is to use the average > tag weight in EWMA, like this: > > sbitmap_weight() / hctx->tags->active_queues Hello Ming, That approach would result in a severe performance degradation. "active_queues" namely represents the number of queues against which I/O ever has been queued. If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are responding and if the queue depth would also be 64 then the approach you proposed will reduce the effective queue depth per LUN from 64 to 1. Bart.
On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote: > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > What happens with fluid congestion boundaries, with shared tags? > > > > The approach in this patch should work, but the threshold may not > > be accurate in this way, one simple method is to use the average > > tag weight in EWMA, like this: > > > > sbitmap_weight() / hctx->tags->active_queues > > Hello Ming, > > That approach would result in a severe performance degradation. "active_queues" > namely represents the number of queues against which I/O ever has been queued. > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are > responding and if the queue depth would also be 64 then the approach you > proposed will reduce the effective queue depth per LUN from 64 to 1. No, this approach does _not_ reduce the effective queue depth, it only stops the queue for a while when the queue is busy enough. In this case, there may not have congestion because for blk-mq at most allows to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue(). Then get_driver_tag() can only allow to return one pending tag at most to the request_queue(LUN). The algorithm in this patch only starts to work when congestion happens, that said it is only run when BLK_STS_RESOURCE is returned from .queue_rq(). This approach is for avoiding to dispatch requests to one busy queue unnecessarily, so that we don't need to heat CPU unnecessarily, and merge gets improved meantime.
On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote: > On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote: > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > > What happens with fluid congestion boundaries, with shared tags? > > > > > > The approach in this patch should work, but the threshold may not > > > be accurate in this way, one simple method is to use the average > > > tag weight in EWMA, like this: > > > > > > sbitmap_weight() / hctx->tags->active_queues > > > > Hello Ming, > > > > That approach would result in a severe performance degradation. "active_queues" > > namely represents the number of queues against which I/O ever has been queued. > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are > > responding and if the queue depth would also be 64 then the approach you > > proposed will reduce the effective queue depth per LUN from 64 to 1. > > No, this approach does _not_ reduce the effective queue depth, it only > stops the queue for a while when the queue is busy enough. > > In this case, there may not have congestion because for blk-mq at most allows > to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue(). Hello Ming, hctx_may_queue() severely limits the queue depth if many LUNs are associated with the same SCSI host. I think that this is a performance regression compared to scsi-sq and that this performance regression should be fixed. Bart.
On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote: > On Thu, 2017-07-13 at 18:43 +0800, Ming Lei wrote: > > On Wed, Jul 12, 2017 at 03:39:14PM +0000, Bart Van Assche wrote: > > > On Wed, 2017-07-12 at 10:30 +0800, Ming Lei wrote: > > > > On Tue, Jul 11, 2017 at 12:25:16PM -0600, Jens Axboe wrote: > > > > > What happens with fluid congestion boundaries, with shared tags? > > > > > > > > The approach in this patch should work, but the threshold may not > > > > be accurate in this way, one simple method is to use the average > > > > tag weight in EWMA, like this: > > > > > > > > sbitmap_weight() / hctx->tags->active_queues > > > > > > Hello Ming, > > > > > > That approach would result in a severe performance degradation. "active_queues" > > > namely represents the number of queues against which I/O ever has been queued. > > > If e.g. 64 LUNs would be associated with a single SCSI host and all 64 LUNs are > > > responding and if the queue depth would also be 64 then the approach you > > > proposed will reduce the effective queue depth per LUN from 64 to 1. > > > > No, this approach does _not_ reduce the effective queue depth, it only > > stops the queue for a while when the queue is busy enough. > > > > In this case, there may not have congestion because for blk-mq at most allows > > to assign queue_depth/active_queues tags to each LUN, please see hctx_may_queue(). > > Hello Ming, > > hctx_may_queue() severely limits the queue depth if many LUNs are associated > with the same SCSI host. I think that this is a performance regression > compared to scsi-sq and that this performance regression should be fixed. IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq: - how many LUNs do you run IO on concurrently? - evaluate the perf on single LUN or multi LUN? BTW, active_queues is a runtime variable which accounts the actual active queues in use.
On Thu, 2017-07-13 at 23:32 +0800, Ming Lei wrote: > On Thu, Jul 13, 2017 at 02:56:38PM +0000, Bart Van Assche wrote: > > hctx_may_queue() severely limits the queue depth if many LUNs are associated > > with the same SCSI host. I think that this is a performance regression > > compared to scsi-sq and that this performance regression should be fixed. > > IMO, it is hard to evaluate/compare perf between scsi-mq vs scsi-sq: > > - how many LUNs do you run IO on concurrently? > - evaluate the perf on single LUN or multi LUN? > > BTW, active_queues is a runtime variable which accounts the actual active > queues in use. Hello Ming, What I described can be verified easily by running fio against a single LUN of a SCSI host with which a large number of LUNs are associated. BTW, something I overlooked in my analysis of the active_queues variable is that it is not only decremented if a queue is destroyed but also if a queue is idle during (block layer request timeout) seconds. So the problem I described will only occur if some software submits I/O requests periodically to all SCSI LUNs with a shorter interval than the I/O timeout. I'm not sure any Linux software does this today. As an example, the time between path checks by the multipathd TUR checker typically exceeds the timeout of a single TUR check. Bart.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 6e0fc80aa151..da50c187c508 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -976,6 +976,18 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx) return true; } +static void blk_mq_update_req_dispatch_busy(struct blk_mq_hw_ctx *hctx) +{ + struct sbitmap_queue *sbq; + unsigned depth; + + sbq = &hctx->tags->bitmap_tags; + depth = sbitmap_weight(&sbq->sb); + + /* use EWMA to estimate a threshold for detecting congestion */ + ewma_add(hctx->avg_busy_threshold, depth, 8, 0); +} + bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) { struct blk_mq_hw_ctx *hctx; @@ -1064,6 +1076,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list) spin_lock(&hctx->lock); list_splice_init(list, &hctx->dispatch); + blk_mq_update_req_dispatch_busy(hctx); spin_unlock(&hctx->lock); /* @@ -1468,6 +1481,7 @@ static void blk_mq_direct_dispatch(struct blk_mq_hw_ctx *hctx, { spin_lock(&hctx->lock); list_add(&rq->queuelist, &hctx->dispatch); + blk_mq_update_req_dispatch_busy(hctx); spin_unlock(&hctx->lock); blk_mq_run_hw_queue(hctx, false); diff --git a/block/blk-mq.h b/block/blk-mq.h index 60b01c0309bc..c4516d2a2d2c 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -133,4 +133,13 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) return hctx->nr_ctx && hctx->tags; } +/* borrowed from bcache */ +#define ewma_add(ewma, val, weight, factor) \ +({ \ + (ewma) *= (weight) - 1; \ + (ewma) += (val) << factor; \ + (ewma) /= (weight); \ + (ewma) >> factor; \ +}) + #endif diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 14542308d25b..8694fb39cd80 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -22,6 +22,8 @@ struct blk_mq_hw_ctx { unsigned long flags; /* BLK_MQ_F_* flags */ + unsigned long avg_busy_threshold; + void *sched_data; struct request_queue *queue; struct blk_flush_queue *fq;
When .queue_rq() returns BLK_STS_RESOURCE(BUSY), we can consider that there is congestion in either low level driver or hardware. This patch uses EWMA to estimate this congestion threshold, then this threshold can be used to detect/avoid congestion. Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 14 ++++++++++++++ block/blk-mq.h | 9 +++++++++ include/linux/blk-mq.h | 2 ++ 3 files changed, 25 insertions(+)