diff mbox

blk-throttle: ignore discard request size

Message ID 67ffcf14c2d15622b84c60a493b590dd81a07f51.1503068984.git.shli@fb.com (mailing list archive)
State New, archived
Headers show

Commit Message

Shaohua Li Aug. 18, 2017, 3:13 p.m. UTC
discard request usually is very big and easily use all bandwidth budget
of a cgroup. discard request size doesn't really mean the size of data
written, so it doesn't make sense to account it into bandwidth budget.
This patch ignores discard requests size. It makes sense to account
discard request into iops budget though.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 block/blk-throttle.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Comments

Jens Axboe Aug. 18, 2017, 3:35 p.m. UTC | #1
On 08/18/2017 09:13 AM, Shaohua Li wrote:
> discard request usually is very big and easily use all bandwidth budget
> of a cgroup. discard request size doesn't really mean the size of data
> written, so it doesn't make sense to account it into bandwidth budget.
> This patch ignores discard requests size. It makes sense to account
> discard request into iops budget though.

Some (most) devices to touch media for a discard operation, but the
cost tends to be fairly constant and independent of discard size.
Would it make sense to just treat it as a constant cost? Zero
cost seems wrong.
Shaohua Li Aug. 18, 2017, 4:28 p.m. UTC | #2
On Fri, Aug 18, 2017 at 09:35:01AM -0600, Jens Axboe wrote:
> On 08/18/2017 09:13 AM, Shaohua Li wrote:
> > discard request usually is very big and easily use all bandwidth budget
> > of a cgroup. discard request size doesn't really mean the size of data
> > written, so it doesn't make sense to account it into bandwidth budget.
> > This patch ignores discard requests size. It makes sense to account
> > discard request into iops budget though.
> 
> Some (most) devices to touch media for a discard operation, but the
> cost tends to be fairly constant and independent of discard size.
> Would it make sense to just treat it as a constant cost? Zero
> cost seems wrong.

that would be hard to find the cost. Would this make sense?

min_t(unsigned int, bio->bi_iter.bi_size, queue_max_sectors(q) << 9)
Jens Axboe Aug. 18, 2017, 7:06 p.m. UTC | #3
On 08/18/2017 10:28 AM, Shaohua Li wrote:
> On Fri, Aug 18, 2017 at 09:35:01AM -0600, Jens Axboe wrote:
>> On 08/18/2017 09:13 AM, Shaohua Li wrote:
>>> discard request usually is very big and easily use all bandwidth budget
>>> of a cgroup. discard request size doesn't really mean the size of data
>>> written, so it doesn't make sense to account it into bandwidth budget.
>>> This patch ignores discard requests size. It makes sense to account
>>> discard request into iops budget though.
>>
>> Some (most) devices to touch media for a discard operation, but the
>> cost tends to be fairly constant and independent of discard size.
>> Would it make sense to just treat it as a constant cost? Zero
>> cost seems wrong.
> 
> that would be hard to find the cost. Would this make sense?
> 
> min_t(unsigned int, bio->bi_iter.bi_size, queue_max_sectors(q) << 9)

It's all going to be approximations, for sure, unfortunately it isn't
an exact science. Why not just use a constant small value? If we assume
that a 4k and 8M discard end up writing roughly the same to media, then
it would follow that just using a smaller constant value (regardless of
actual discard command size) would be useful.
Shaohua Li Aug. 18, 2017, 7:12 p.m. UTC | #4
On Fri, Aug 18, 2017 at 01:06:46PM -0600, Jens Axboe wrote:
> On 08/18/2017 10:28 AM, Shaohua Li wrote:
> > On Fri, Aug 18, 2017 at 09:35:01AM -0600, Jens Axboe wrote:
> >> On 08/18/2017 09:13 AM, Shaohua Li wrote:
> >>> discard request usually is very big and easily use all bandwidth budget
> >>> of a cgroup. discard request size doesn't really mean the size of data
> >>> written, so it doesn't make sense to account it into bandwidth budget.
> >>> This patch ignores discard requests size. It makes sense to account
> >>> discard request into iops budget though.
> >>
> >> Some (most) devices to touch media for a discard operation, but the
> >> cost tends to be fairly constant and independent of discard size.
> >> Would it make sense to just treat it as a constant cost? Zero
> >> cost seems wrong.
> > 
> > that would be hard to find the cost. Would this make sense?
> > 
> > min_t(unsigned int, bio->bi_iter.bi_size, queue_max_sectors(q) << 9)
> 
> It's all going to be approximations, for sure, unfortunately it isn't
> an exact science. Why not just use a constant small value? If we assume
> that a 4k and 8M discard end up writing roughly the same to media, then
> it would follow that just using a smaller constant value (regardless of
> actual discard command size) would be useful.

Sounds good. what number do you suggest? queue_max_sectors or a
random number?

Thanks,
Shaohua
Jens Axboe Aug. 18, 2017, 7:15 p.m. UTC | #5
On 08/18/2017 01:12 PM, Shaohua Li wrote:
> On Fri, Aug 18, 2017 at 01:06:46PM -0600, Jens Axboe wrote:
>> On 08/18/2017 10:28 AM, Shaohua Li wrote:
>>> On Fri, Aug 18, 2017 at 09:35:01AM -0600, Jens Axboe wrote:
>>>> On 08/18/2017 09:13 AM, Shaohua Li wrote:
>>>>> discard request usually is very big and easily use all bandwidth budget
>>>>> of a cgroup. discard request size doesn't really mean the size of data
>>>>> written, so it doesn't make sense to account it into bandwidth budget.
>>>>> This patch ignores discard requests size. It makes sense to account
>>>>> discard request into iops budget though.
>>>>
>>>> Some (most) devices to touch media for a discard operation, but the
>>>> cost tends to be fairly constant and independent of discard size.
>>>> Would it make sense to just treat it as a constant cost? Zero
>>>> cost seems wrong.
>>>
>>> that would be hard to find the cost. Would this make sense?
>>>
>>> min_t(unsigned int, bio->bi_iter.bi_size, queue_max_sectors(q) << 9)
>>
>> It's all going to be approximations, for sure, unfortunately it isn't
>> an exact science. Why not just use a constant small value? If we assume
>> that a 4k and 8M discard end up writing roughly the same to media, then
>> it would follow that just using a smaller constant value (regardless of
>> actual discard command size) would be useful.
> 
> Sounds good. what number do you suggest? queue_max_sectors or a
> random number?

Not sure why you want to go that large? Isn't the idea to throttle on
actual device bandwidth used? In which case a much smaller number should
be a lot closer to reality, say like 64 bytes per discard, regardless
of actual size. That still gives you some throttling instead of just
ignoring it, but at a more reasonable rate.
Shaohua Li Aug. 18, 2017, 7:19 p.m. UTC | #6
On Fri, Aug 18, 2017 at 01:15:15PM -0600, Jens Axboe wrote:
> On 08/18/2017 01:12 PM, Shaohua Li wrote:
> > On Fri, Aug 18, 2017 at 01:06:46PM -0600, Jens Axboe wrote:
> >> On 08/18/2017 10:28 AM, Shaohua Li wrote:
> >>> On Fri, Aug 18, 2017 at 09:35:01AM -0600, Jens Axboe wrote:
> >>>> On 08/18/2017 09:13 AM, Shaohua Li wrote:
> >>>>> discard request usually is very big and easily use all bandwidth budget
> >>>>> of a cgroup. discard request size doesn't really mean the size of data
> >>>>> written, so it doesn't make sense to account it into bandwidth budget.
> >>>>> This patch ignores discard requests size. It makes sense to account
> >>>>> discard request into iops budget though.
> >>>>
> >>>> Some (most) devices to touch media for a discard operation, but the
> >>>> cost tends to be fairly constant and independent of discard size.
> >>>> Would it make sense to just treat it as a constant cost? Zero
> >>>> cost seems wrong.
> >>>
> >>> that would be hard to find the cost. Would this make sense?
> >>>
> >>> min_t(unsigned int, bio->bi_iter.bi_size, queue_max_sectors(q) << 9)
> >>
> >> It's all going to be approximations, for sure, unfortunately it isn't
> >> an exact science. Why not just use a constant small value? If we assume
> >> that a 4k and 8M discard end up writing roughly the same to media, then
> >> it would follow that just using a smaller constant value (regardless of
> >> actual discard command size) would be useful.
> > 
> > Sounds good. what number do you suggest? queue_max_sectors or a
> > random number?
> 
> Not sure why you want to go that large? Isn't the idea to throttle on
> actual device bandwidth used? In which case a much smaller number should
> be a lot closer to reality, say like 64 bytes per discard, regardless
> of actual size. That still gives you some throttling instead of just
> ignoring it, but at a more reasonable rate.

hmm, my guess is discard is more costly than normal write in some drivers, but
that's just my guess. I'll make it 512B then to make sure nothing is blown.

Thanks,
Shaohua
diff mbox

Patch

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 6a4c4c4..f80acc1 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -380,6 +380,13 @@  static unsigned int tg_iops_limit(struct throtl_grp *tg, int rw)
 	}								\
 } while (0)
 
+static inline unsigned int throtl_bio_data_size(struct bio *bio)
+{
+	if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
+		return 0;
+	return bio->bi_iter.bi_size;
+}
+
 static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg)
 {
 	INIT_LIST_HEAD(&qn->node);
@@ -932,6 +939,7 @@  static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	bool rw = bio_data_dir(bio);
 	u64 bytes_allowed, extra_bytes, tmp;
 	unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
+	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	jiffy_elapsed = jiffy_elapsed_rnd = jiffies - tg->slice_start[rw];
 
@@ -945,14 +953,14 @@  static bool tg_with_in_bps_limit(struct throtl_grp *tg, struct bio *bio,
 	do_div(tmp, HZ);
 	bytes_allowed = tmp;
 
-	if (tg->bytes_disp[rw] + bio->bi_iter.bi_size <= bytes_allowed) {
+	if (tg->bytes_disp[rw] + bio_size <= bytes_allowed) {
 		if (wait)
 			*wait = 0;
 		return true;
 	}
 
 	/* Calc approx time to dispatch */
-	extra_bytes = tg->bytes_disp[rw] + bio->bi_iter.bi_size - bytes_allowed;
+	extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
 	jiffy_wait = div64_u64(extra_bytes * HZ, tg_bps_limit(tg, rw));
 
 	if (!jiffy_wait)
@@ -1032,11 +1040,12 @@  static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 static void throtl_charge_bio(struct throtl_grp *tg, struct bio *bio)
 {
 	bool rw = bio_data_dir(bio);
+	unsigned int bio_size = throtl_bio_data_size(bio);
 
 	/* Charge the bio to the group */
-	tg->bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->bytes_disp[rw] += bio_size;
 	tg->io_disp[rw]++;
-	tg->last_bytes_disp[rw] += bio->bi_iter.bi_size;
+	tg->last_bytes_disp[rw] += bio_size;
 	tg->last_io_disp[rw]++;
 
 	/*