Message ID | 54FCF1B2.8030007@ce.jp.nec.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Delegated to: | christophe varoqui |
Headers | show |
> -----Original Message----- > From: Junichi Nomura [mailto:j-nomura@ce.jp.nec.com] > Sent: Sunday, March 08, 2015 8:05 PM > To: device-mapper development; Mike Snitzer > Cc: axboe@kernel.dk; jmoyer@redhat.com; Hannes Reinecke; Merla, > ShivaKrishna > Subject: Re: [dm-devel] [PATCH 6/8] dm: don't start current request if it > would've merged with the previous > > On 03/04/15 09:47, Mike Snitzer wrote: > > Request-based DM's dm_request_fn() is so fast to pull requests off the > > queue that steps need to be taken to promote merging by avoiding > request > > processing if it makes sense. > > > > If the current request would've merged with previous request let the > > current request stay on the queue longer. > > Hi Mike, > > Looking at this thread, I think there are 2 different problems mixed. > > Firstly, "/dev/skd" is STEC S1120 block driver, which doesn't have > lld_busy function. So back pressure doesn't propagate to request-based > dm device and dm feeds as many request as possible to the lower driver. > ("pulling too fast" situation) > If you still have access to the device, can you try the patch like > the attached one? > > Secondly, for this comment from Merla ShivaKrishna: > > > Yes, Indeed this the exact issue we saw at NetApp. While running > sequential > > 4K write I/O with large thread count, 2 paths yield better performance than > > 4 paths and performance drastically drops with 4 paths. The device > queue_depth > > as 32 and with blktrace we could see better I/O merging happening and > average > > request size was > 8K through iostat. With 4 paths none of the I/O gets > merged and > > always average request size is 4K. Scheduler used was noop as we are using > SSD > > based storage. We could get I/O merging to happen even with 4 paths but > with lower > > device queue_depth of 16. Even then the performance was lacking > compared to 2 paths. > > Have you tried increasing nr_requests of the dm device? > E.g. setting nr_requests to 256. > > 4 paths with each queue depth 32 means that it can have 128 I/Os in flight. > With the default value of nr_requests 128, the request queue is almost > always empty and I/O merge could not happen. > Increasing nr_requests of the dm device allows some more requests > queued, > thus the chance of merging may increase. > Reducing the lower device queue depth could be another solution. But if > the depth is too low, you might not be able to keep the optimal speed. > Yes, we have tried this as well but didn't help. Indeed, we have tested with queue_depth of 16 on each path as well with 64 I/O's in flight and resulted in same issue. We did try reducing the queue_depth with 4 paths, but couldn't achieve comparable performance as of 2 paths. With Mike's patch, we see tremendous improvement with just a small delay of ~20us with 4 paths. This might vary with different configurations but sure have proved that a tunable to delay dispatches with sequential workloads has helped a lot. > ---- > Jun'ichi Nomura, NEC Corporation > > > [PATCH] skd: Add lld_busy function for request-based stacking driver > > diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c > index 1e46eb2..0e8f466 100644 > --- a/drivers/block/skd_main.c > +++ b/drivers/block/skd_main.c > @@ -565,6 +565,16 @@ skd_prep_discard_cdb(struct skd_scsi_request > *scsi_req, > blk_add_request_payload(req, page, len); > } > > +static int skd_lld_busy(struct request_queue *q) > +{ > + struct skd_device *skdev = q->queuedata; > + > + if (skdev->in_flight >= skdev->cur_max_queue_depth) > + return 1; > + > + return 0; > +} > + > static void skd_request_fn_not_online(struct request_queue *q); > > static void skd_request_fn(struct request_queue *q) > @@ -4419,6 +4429,8 @@ static int skd_cons_disk(struct skd_device *skdev) > /* set sysfs ptimal_io_size to 8K */ > blk_queue_io_opt(q, 8192); > > + /* register feed back function for stacking driver */ > + blk_queue_lld_busy(q, skd_lld_busy); > + > /* DISCARD Flag initialization. */ > q->limits.discard_granularity = 8192; > q->limits.discard_alignment = 0; -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 03/09/15 12:30, Merla, ShivaKrishna wrote: >> Secondly, for this comment from Merla ShivaKrishna: >> >>> Yes, Indeed this the exact issue we saw at NetApp. While running sequential >>> 4K write I/O with large thread count, 2 paths yield better performance than >>> 4 paths and performance drastically drops with 4 paths. The device queue_depth >>> as 32 and with blktrace we could see better I/O merging happening and average >>> request size was > 8K through iostat. With 4 paths none of the I/O gets merged and >>> always average request size is 4K. Scheduler used was noop as we are using SSD >>> based storage. We could get I/O merging to happen even with 4 paths but with lower >>> device queue_depth of 16. Even then the performance was lacking compared to 2 paths. >> >> Have you tried increasing nr_requests of the dm device? >> E.g. setting nr_requests to 256. >> >> 4 paths with each queue depth 32 means that it can have 128 I/Os in flight. >> With the default value of nr_requests 128, the request queue is almost >> always empty and I/O merge could not happen. >> Increasing nr_requests of the dm device allows some more requests >> queued, >> thus the chance of merging may increase. >> Reducing the lower device queue depth could be another solution. But if >> the depth is too low, you might not be able to keep the optimal speed. >> > Yes, we have tried this as well but didn't help. Indeed, we have tested with queue_depth > of 16 on each path as well with 64 I/O's in flight and resulted in same issue. We did try > reducing the queue_depth with 4 paths, but couldn't achieve comparable performance > as of 2 paths. With Mike's patch, we see tremendous improvement with just a small delay > of ~20us with 4 paths. This might vary with different configurations but sure have proved > that a tunable to delay dispatches with sequential workloads has helped a lot. Hi, did you try increasing nr_requests of dm request queue? If so, what was the increased value of nr_requests in the case of device queue_depth 32?
> -----Original Message----- > From: Junichi Nomura [mailto:j-nomura@ce.jp.nec.com] > Sent: Monday, March 09, 2015 1:10 AM > To: Merla, ShivaKrishna > Cc: device-mapper development; Mike Snitzer; axboe@kernel.dk; > jmoyer@redhat.com; Hannes Reinecke > Subject: Re: [dm-devel] [PATCH 6/8] dm: don't start current request if it > would've merged with the previous > > On 03/09/15 12:30, Merla, ShivaKrishna wrote: > >> Secondly, for this comment from Merla ShivaKrishna: > >> > >>> Yes, Indeed this the exact issue we saw at NetApp. While running > sequential > >>> 4K write I/O with large thread count, 2 paths yield better performance > than > >>> 4 paths and performance drastically drops with 4 paths. The device > queue_depth > >>> as 32 and with blktrace we could see better I/O merging happening and > average > >>> request size was > 8K through iostat. With 4 paths none of the I/O gets > merged and > >>> always average request size is 4K. Scheduler used was noop as we are > using SSD > >>> based storage. We could get I/O merging to happen even with 4 paths > but with lower > >>> device queue_depth of 16. Even then the performance was lacking > compared to 2 paths. > >> > >> Have you tried increasing nr_requests of the dm device? > >> E.g. setting nr_requests to 256. > >> > >> 4 paths with each queue depth 32 means that it can have 128 I/Os in flight. > >> With the default value of nr_requests 128, the request queue is almost > >> always empty and I/O merge could not happen. > >> Increasing nr_requests of the dm device allows some more requests > >> queued, > >> thus the chance of merging may increase. > >> Reducing the lower device queue depth could be another solution. But if > >> the depth is too low, you might not be able to keep the optimal speed. > >> > > Yes, we have tried this as well but didn't help. Indeed, we have tested with > queue_depth > > of 16 on each path as well with 64 I/O's in flight and resulted in same issue. > We did try > > reducing the queue_depth with 4 paths, but couldn't achieve comparable > performance > > as of 2 paths. With Mike's patch, we see tremendous improvement with > just a small delay > > of ~20us with 4 paths. This might vary with different configurations but sure > have proved > > that a tunable to delay dispatches with sequential workloads has helped a > lot. > > Hi, > > did you try increasing nr_requests of dm request queue? > If so, what was the increased value of nr_requests in the case of > device queue_depth 32? > Yes, we tried increasing it to 256, the average merge count certainly increased a little bit but not comparable as to Mike's change. 03/09/2015 11:07:54 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdak 0.00 0.00 0.00 21737.00 0.00 101064.00 9.30 11.93 0.55 0.00 0.55 0.04 93.00 sdu 0.00 0.00 0.00 21759.00 0.00 101728.00 9.35 11.55 0.53 0.00 0.53 0.04 93.60 sdm 0.00 0.00 0.00 21669.00 0.00 101168.00 9.34 11.76 0.54 0.00 0.54 0.04 94.00 sdac 0.00 0.00 0.00 21812.00 0.00 101540.00 9.31 11.74 0.54 0.00 0.54 0.04 92.50 dm-6 0.00 14266.00 0.00 86980.00 0.00 405496.00 9.32 48.44 0.56 0.00 0.56 0.01 98.70 With tunable delay of 20us here are the results. 03/09/2015 11:08:43 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdak 0.00 0.00 0.00 11740.00 0.00 135344.00 23.06 4.42 0.38 0.00 0.38 0.05 62.60 sdu 0.00 0.00 0.00 11781.00 0.00 140800.00 23.90 3.23 0.27 0.00 0.27 0.05 62.80 sdm 0.00 0.00 0.00 11770.00 0.00 137592.00 23.38 4.53 0.39 0.00 0.39 0.06 65.60 sdac 0.00 0.00 0.00 11664.00 0.00 137976.00 23.66 3.36 0.29 0.00 0.29 0.05 60.80 dm-6 0.00 88446.00 0.00 46937.00 0.00 551684.00 23.51 17.88 0.38 0.00 0.38 0.02 99.30 > -- > Jun'ichi Nomura, NEC Corporation -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 03/10/15 01:10, Merla, ShivaKrishna wrote: >> did you try increasing nr_requests of dm request queue? >> If so, what was the increased value of nr_requests in the case of >> device queue_depth 32? >> > Yes, we tried increasing it to 256, the average merge count > certainly increased a little bit but not comparable as to Mike's change. > > > 03/09/2015 11:07:54 AM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdak 0.00 0.00 0.00 21737.00 0.00 101064.00 9.30 11.93 0.55 0.00 0.55 0.04 93.00 > sdu 0.00 0.00 0.00 21759.00 0.00 101728.00 9.35 11.55 0.53 0.00 0.53 0.04 93.60 > sdm 0.00 0.00 0.00 21669.00 0.00 101168.00 9.34 11.76 0.54 0.00 0.54 0.04 94.00 > sdac 0.00 0.00 0.00 21812.00 0.00 101540.00 9.31 11.74 0.54 0.00 0.54 0.04 92.50 > dm-6 0.00 14266.00 0.00 86980.00 0.00 405496.00 9.32 48.44 0.56 0.00 0.56 0.01 98.70 > > With tunable delay of 20us here are the results. > > 03/09/2015 11:08:43 AM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdak 0.00 0.00 0.00 11740.00 0.00 135344.00 23.06 4.42 0.38 0.00 0.38 0.05 62.60 > sdu 0.00 0.00 0.00 11781.00 0.00 140800.00 23.90 3.23 0.27 0.00 0.27 0.05 62.80 > sdm 0.00 0.00 0.00 11770.00 0.00 137592.00 23.38 4.53 0.39 0.00 0.39 0.06 65.60 > sdac 0.00 0.00 0.00 11664.00 0.00 137976.00 23.66 3.36 0.29 0.00 0.29 0.05 60.80 > dm-6 0.00 88446.00 0.00 46937.00 0.00 551684.00 23.51 17.88 0.38 0.00 0.38 0.02 99.30 Oh I see. Thank you. So it's not that requests weren't queued for merging but that CPUs could not pile the requests fast enough... If possible, it would be interesting to see the results with much lower device queue_depth like 4 or 2.
> -----Original Message----- > From: Junichi Nomura [mailto:j-nomura@ce.jp.nec.com] > Sent: Monday, March 09, 2015 8:06 PM > To: Merla, ShivaKrishna > Cc: device-mapper development; Mike Snitzer; axboe@kernel.dk; > jmoyer@redhat.com; Hannes Reinecke > Subject: Re: [dm-devel] [PATCH 6/8] dm: don't start current request if it > would've merged with the previous > > On 03/10/15 01:10, Merla, ShivaKrishna wrote: > >> did you try increasing nr_requests of dm request queue? > >> If so, what was the increased value of nr_requests in the case of > >> device queue_depth 32? > >> > > Yes, we tried increasing it to 256, the average merge count > > certainly increased a little bit but not comparable as to Mike's change. > > > > > > 03/09/2015 11:07:54 AM > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz > await r_await w_await svctm %util > > sdak 0.00 0.00 0.00 21737.00 0.00 101064.00 9.30 11.93 0.55 > 0.00 0.55 0.04 93.00 > > sdu 0.00 0.00 0.00 21759.00 0.00 101728.00 9.35 11.55 0.53 > 0.00 0.53 0.04 93.60 > > sdm 0.00 0.00 0.00 21669.00 0.00 101168.00 9.34 11.76 0.54 > 0.00 0.54 0.04 94.00 > > sdac 0.00 0.00 0.00 21812.00 0.00 101540.00 9.31 11.74 0.54 > 0.00 0.54 0.04 92.50 > > dm-6 0.00 14266.00 0.00 86980.00 0.00 405496.00 9.32 48.44 > 0.56 0.00 0.56 0.01 98.70 > > > > With tunable delay of 20us here are the results. > > > > 03/09/2015 11:08:43 AM > > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz > await r_await w_await svctm %util > > sdak 0.00 0.00 0.00 11740.00 0.00 135344.00 23.06 4.42 0.38 > 0.00 0.38 0.05 62.60 > > sdu 0.00 0.00 0.00 11781.00 0.00 140800.00 23.90 3.23 0.27 > 0.00 0.27 0.05 62.80 > > sdm 0.00 0.00 0.00 11770.00 0.00 137592.00 23.38 4.53 0.39 > 0.00 0.39 0.06 65.60 > > sdac 0.00 0.00 0.00 11664.00 0.00 137976.00 23.66 3.36 0.29 > 0.00 0.29 0.05 60.80 > > dm-6 0.00 88446.00 0.00 46937.00 0.00 551684.00 23.51 17.88 > 0.38 0.00 0.38 0.02 99.30 > > Oh I see. Thank you. > So it's not that requests weren't queued for merging but that CPUs > could not pile the requests fast enough... > > If possible, it would be interesting to see the results with much > lower device queue_depth like 4 or 2. I think very low queue_depths will lead multipath_busy() to return 1 even in case of large number of paths. Hence leads to better I/O merging. queue_depth 2 03/09/2015 08:47:04 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdak 0.00 0.00 0.00 10512.00 0.00 106512.00 20.26 12.60 1.20 0.00 1.20 0.10 100.00 sdu 0.00 0.00 0.00 10546.00 0.00 105728.00 20.05 12.92 1.23 0.00 1.23 0.09 100.00 sdm 0.00 0.00 0.00 10518.00 0.00 106108.00 20.18 12.80 1.22 0.00 1.22 0.09 99.90 sdac 0.00 0.00 0.00 10548.00 0.00 106100.00 20.12 13.10 1.24 0.00 1.24 0.09 100.00 dm-6 0.00 62581.00 0.00 42122.00 0.00 424420.00 20.15 53.77 1.28 0.00 1.28 0.02 100.00 queue_depth 4 03/09/2015 08:54:27 PM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdak 0.00 0.00 0.00 15671.00 0.00 109292.00 13.95 8.64 0.55 0.00 0.55 0.06 99.80 sdu 0.00 0.00 0.00 15733.00 0.00 109204.00 13.88 8.34 0.53 0.00 0.53 0.06 99.40 sdm 0.00 0.00 0.00 15779.00 0.00 106788.00 13.54 8.57 0.54 0.00 0.54 0.06 99.50 sdac 0.00 0.00 0.00 15611.00 0.00 109568.00 14.04 8.31 0.53 0.00 0.53 0.06 98.70 dm-6 0.00 44626.00 0.00 62795.00 0.00 434832.00 13.85 36.12 0.58 0.00 0.58 0.02 100.00 > > -- > Jun'ichi Nomura, NEC Corporation -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
On 03/10/15 10:59, Merla, ShivaKrishna wrote: >>> 03/09/2015 11:07:54 AM >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz >> await r_await w_await svctm %util >>> sdak 0.00 0.00 0.00 21737.00 0.00 101064.00 9.30 11.93 0.55 >> 0.00 0.55 0.04 93.00 >>> sdu 0.00 0.00 0.00 21759.00 0.00 101728.00 9.35 11.55 0.53 >> 0.00 0.53 0.04 93.60 >>> sdm 0.00 0.00 0.00 21669.00 0.00 101168.00 9.34 11.76 0.54 >> 0.00 0.54 0.04 94.00 >>> sdac 0.00 0.00 0.00 21812.00 0.00 101540.00 9.31 11.74 0.54 >> 0.00 0.54 0.04 92.50 >>> dm-6 0.00 14266.00 0.00 86980.00 0.00 405496.00 9.32 48.44 >> 0.56 0.00 0.56 0.01 98.70 >>> >>> With tunable delay of 20us here are the results. >>> >>> 03/09/2015 11:08:43 AM >>> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz >> await r_await w_await svctm %util >>> sdak 0.00 0.00 0.00 11740.00 0.00 135344.00 23.06 4.42 0.38 >> 0.00 0.38 0.05 62.60 >>> sdu 0.00 0.00 0.00 11781.00 0.00 140800.00 23.90 3.23 0.27 >> 0.00 0.27 0.05 62.80 >>> sdm 0.00 0.00 0.00 11770.00 0.00 137592.00 23.38 4.53 0.39 >> 0.00 0.39 0.06 65.60 >>> sdac 0.00 0.00 0.00 11664.00 0.00 137976.00 23.66 3.36 0.29 >> 0.00 0.29 0.05 60.80 >>> dm-6 0.00 88446.00 0.00 46937.00 0.00 551684.00 23.51 17.88 >> 0.38 0.00 0.38 0.02 99.30 >> >> Oh I see. Thank you. >> So it's not that requests weren't queued for merging but that CPUs >> could not pile the requests fast enough... >> >> If possible, it would be interesting to see the results with much >> lower device queue_depth like 4 or 2. > I think very low queue_depths will lead multipath_busy() to return 1 > even in case of large number of paths. Hence leads to better I/O merging. Yes. But it didn't help very much.. > queue_depth 2 > > 03/09/2015 08:47:04 PM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdak 0.00 0.00 0.00 10512.00 0.00 106512.00 20.26 12.60 1.20 0.00 1.20 0.10 100.00 > sdu 0.00 0.00 0.00 10546.00 0.00 105728.00 20.05 12.92 1.23 0.00 1.23 0.09 100.00 > sdm 0.00 0.00 0.00 10518.00 0.00 106108.00 20.18 12.80 1.22 0.00 1.22 0.09 99.90 > sdac 0.00 0.00 0.00 10548.00 0.00 106100.00 20.12 13.10 1.24 0.00 1.24 0.09 100.00 > dm-6 0.00 62581.00 0.00 42122.00 0.00 424420.00 20.15 53.77 1.28 0.00 1.28 0.02 100.00 > > queue_depth 4 > > 03/09/2015 08:54:27 PM > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdak 0.00 0.00 0.00 15671.00 0.00 109292.00 13.95 8.64 0.55 0.00 0.55 0.06 99.80 > sdu 0.00 0.00 0.00 15733.00 0.00 109204.00 13.88 8.34 0.53 0.00 0.53 0.06 99.40 > sdm 0.00 0.00 0.00 15779.00 0.00 106788.00 13.54 8.57 0.54 0.00 0.54 0.06 99.50 > sdac 0.00 0.00 0.00 15611.00 0.00 109568.00 14.04 8.31 0.53 0.00 0.53 0.06 98.70 > dm-6 0.00 44626.00 0.00 62795.00 0.00 434832.00 13.85 36.12 0.58 0.00 0.58 0.02 100.00
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c index 1e46eb2..0e8f466 100644 --- a/drivers/block/skd_main.c +++ b/drivers/block/skd_main.c @@ -565,6 +565,16 @@ skd_prep_discard_cdb(struct skd_scsi_request *scsi_req, blk_add_request_payload(req, page, len); } +static int skd_lld_busy(struct request_queue *q) +{ + struct skd_device *skdev = q->queuedata; + + if (skdev->in_flight >= skdev->cur_max_queue_depth) + return 1; + + return 0; +} + static void skd_request_fn_not_online(struct request_queue *q); static void skd_request_fn(struct request_queue *q) @@ -4419,6 +4429,8 @@ static int skd_cons_disk(struct skd_device *skdev) /* set sysfs ptimal_io_size to 8K */ blk_queue_io_opt(q, 8192); + /* register feed back function for stacking driver */ + blk_queue_lld_busy(q, skd_lld_busy); + /* DISCARD Flag initialization. */ q->limits.discard_granularity = 8192; q->limits.discard_alignment = 0;