Message ID | 20170825190057.15592-1-bart.vanassche@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
I still disagree that we should have an indirect function call like this in the fast path. All that can be done by clearing or setting a flag on the first call to ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for that, but for SCSI we probablt can't use that given that it has more meaning for the old request path. But how about just adding a new RQD_DRV_INITIALIZED or similar flag?
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > I still disagree that we should have an indirect function call like this > in the fast path. > > All that can be done by clearing or setting a flag on the first call to > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > that, but for SCSI we probably can't use that given that it has more > meaning for the old request path. But how about just adding a new > RQD_DRV_INITIALIZED or similar flag? Hello Christoph, Sorry but I'm not enthusiast about the proposal to introduce a RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying behavior difference, namely that .initialize_rq_fn() would be called from inside blk_get_request() for pass-through requests and from inside the prep function for filesystem requests. Another disadvantage of that approach is that the block layer core never clears request.atomic_flags completely but only sets and clears individual flags. The SCSI core would have to follow that model and hence code for clearing RQD_DRV_INITIALIZED would have to be added to all request completion paths in the SCSI core. Have you noticed that Ming Lei's patch series introduces several new atomic operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY manipulations. Have you noticed that for SCSI drivers these patches introduce an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path? I have derived these numbers from the random write SRP performance numbers as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be multiplied with the number of I/O requests processed in parallel. The number of jobs in Ming Lei's test was 64 but that's probably way higher than the actual I/O parallelism. Have you noticed that my patch did not add any atomic instructions to the hot path but only a read of a function pointer that should already be cache hot? As you know modern CPUs are good at predicting branches. Are you sure that my patch will have a measurable performance impact? Bart.
On Mon, Aug 28, 2017 at 06:31:33PM +0000, Bart Van Assche wrote: > On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > > I still disagree that we should have an indirect function call like this > > in the fast path. > > > > All that can be done by clearing or setting a flag on the first call to > > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > > that, but for SCSI we probably can't use that given that it has more > > meaning for the old request path. But how about just adding a new > > RQD_DRV_INITIALIZED or similar flag? > > Hello Christoph, > > Sorry but I'm not enthusiast about the proposal to introduce a > RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying > behavior difference, namely that .initialize_rq_fn() would be called from > inside blk_get_request() for pass-through requests and from inside the prep > function for filesystem requests. Another disadvantage of that approach is > that the block layer core never clears request.atomic_flags completely but > only sets and clears individual flags. The SCSI core would have to follow > that model and hence code for clearing RQD_DRV_INITIALIZED would have to be > added to all request completion paths in the SCSI core. > > Have you noticed that Ming Lei's patch series introduces several new atomic > operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY > manipulations. Have you noticed that for SCSI drivers these patches introduce > an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path? > I have derived these numbers from the random write SRP performance numbers > as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be > multiplied with the number of I/O requests processed in parallel. The number > of jobs in Ming Lei's test was 64 but that's probably way higher than the > actual I/O parallelism. Hi Bart, Did you see perf regression on SRP with smaller jobs after applying my patchset V3? I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS. v4.13-rc6+blk-next+patch V3(3rd column): --------------------------------------- IOPS(K) | NONE | NONE --------------------------------------- read | 475.83 | 485.88 --------------------------------------- randread | 142.86 | 141.96 --------------------------------------- write | 483.9 | 492.39 --------------------------------------- randwrite | 124.83 | 124.53 --------------------------------------- --------------------------------------- CPU(%) | NONE | NONE --------------------------------------- read | 15.43 | 15.81 --------------------------------------- randread | 9.87 | 9.75 --------------------------------------- write | 17.67 | 18.34 --------------------------------------- randwrite | 10.96 | 10.56 --------------------------------------- --------------------------------------- LAT(us) | NONE | NONE --------------------------------------- read | 2.15 | 2.11 --------------------------------------- randread | 7.17 | 7.21 --------------------------------------- write | 2.11 | 2.08 --------------------------------------- randwrite | 8.2 | 8.22 --------------------------------------- --------------------------------------- MERGE(K) | NONE | NONE --------------------------------------- read | 8798.59 | 9064.09 --------------------------------------- randread | 0.02 | 0.19 --------------------------------------- write | 8847.73 | 9102.18 --------------------------------------- randwrite | 0.03 | 0.13 ---------------------------------------
Hi Bart, this isn't just about performance - it's about understanability of the I/O path. The legacy request path already has the prep_fn, which is intended for exactly this sort of prep work, but even that led to a lot of confusion. For blk-mq we decided to not add it but let the called driver in control. I really don't want to move away from that. The passthrough requests using blk_get_requst are a special case as the caller allocates the requests, stuffs data into them and only then hands control to the driver, and thus we need some way to initialize the request before handing controller to the driver in this particular special case. And if needed would could actually do that with explicit calls instead of the callback, although you changed it to save a bit of code.
On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote: > Hi Bart, > > Did you see perf regression on SRP with smaller jobs after applying > my patchset V3? > > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS. > v4.13-rc6+blk-next+patch V3(3rd column): > > > --------------------------------------- > IOPS(K) | NONE | NONE > --------------------------------------- > read | 475.83 | 485.88 > --------------------------------------- > randread | 142.86 | 141.96 > --------------------------------------- > write | 483.9 | 492.39 > --------------------------------------- > randwrite | 124.83 | 124.53 > --------------------------------------- > [ ... ] > --------------------------------------- > LAT(us) | NONE | NONE > --------------------------------------- > read | 2.15 | 2.11 > --------------------------------------- > randread | 7.17 | 7.21 > --------------------------------------- > write | 2.11 | 2.08 > --------------------------------------- > randwrite | 8.2 | 8.22 > --------------------------------------- > [ ... ] Hello Ming, Although I would prefer to see measurement data against an SRP target system that supports a higher workload (>1M IOPS) and also for a high-end NVMe drive, I think the above data is sufficient to show that the performance impact of your patch series is most likely small enough even for high-end SCSI initiator drivers. Bart.
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: > All that can be done by clearing or setting a flag on the first call to > ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for > that, but for SCSI we probablt can't use that given that it has more > meaning for the old request path. But how about just adding a new > RQD_DRV_INITIALIZED or similar flag? Hello Christoph, More changes would have to be made to implement the above proposal than just setting a flag at the start of .queue_rq() / .queuecommand() for all filesystem requests. The numerous code paths that lead to a request not being executed immediately, e.g. a SCSI host being busy or a preparation status != BLKPREP_OK, would have to be inspected and code would have to be added to clear the "command initialized" flag to ensure that initialization occurs shortly before the first time a command is executed. The choice we have to make is to add more state information and complicated code for keeping that state information up-to-date to the SCSI core or making a small and easy to maintain change to the block layer core that does not involve any new state information. That's why I'm asking you to reconsider the patch at the start of this e-mail thread. Thanks, Bart.
On 08/29/2017 02:57 PM, Bart Van Assche wrote: > On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote: >> All that can be done by clearing or setting a flag on the first call to >> ->queue_rq or ->queuecommand instead. In NVMe we use RQF_DONTPREP for >> that, but for SCSI we probablt can't use that given that it has more >> meaning for the old request path. But how about just adding a new >> RQD_DRV_INITIALIZED or similar flag? > > Hello Christoph, > > More changes would have to be made to implement the above proposal > than just setting a flag at the start of .queue_rq() / .queuecommand() > for all filesystem requests. The numerous code paths that lead to a > request not being executed immediately, e.g. a SCSI host being busy or > a preparation status != BLKPREP_OK, would have to be inspected and > code would have to be added to clear the "command initialized" flag to > ensure that initialization occurs shortly before the first time a > command is executed. > > The choice we have to make is to add more state information and > complicated code for keeping that state information up-to-date to the > SCSI core or making a small and easy to maintain change to the block > layer core that does not involve any new state information. That's why > I'm asking you to reconsider the patch at the start of this e-mail > thread. I don't like this addition, and honestly I wasn't a huge fan of adding ->init() hooks to the alloc path when we initially added this earlier this summer. The fact that we now have to make this more invasive doesn't improve the situation at all. So fwiw, I too would much rather see an implementation based on an RQF_ flag instead.
On Tue, Aug 29, 2017 at 08:15:23PM +0000, Bart Van Assche wrote: > On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote: > > Hi Bart, > > > > Did you see perf regression on SRP with smaller jobs after applying > > my patchset V3? > > > > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64, > > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS. > > v4.13-rc6+blk-next+patch V3(3rd column): > > > > > > --------------------------------------- > > IOPS(K) | NONE | NONE > > --------------------------------------- > > read | 475.83 | 485.88 > > --------------------------------------- > > randread | 142.86 | 141.96 > > --------------------------------------- > > write | 483.9 | 492.39 > > --------------------------------------- > > randwrite | 124.83 | 124.53 > > --------------------------------------- > > [ ... ] > > --------------------------------------- > > LAT(us) | NONE | NONE > > --------------------------------------- > > read | 2.15 | 2.11 > > --------------------------------------- > > randread | 7.17 | 7.21 > > --------------------------------------- > > write | 2.11 | 2.08 > > --------------------------------------- > > randwrite | 8.2 | 8.22 > > --------------------------------------- > > [ ... ] > > Hello Ming, > > Although I would prefer to see measurement data against an SRP target system > that supports a higher workload (>1M IOPS) and Hi Bart, For so higher workload, I guess it often requires to increase .cmd_per_lun. > also for a high-end NVMe drive, My patch won't affect NVMe drive since NVMe driver doesn't become busy usually. > I think the above data is sufficient to show that the performance impact of > your patch series is most likely small enough even for high-end SCSI initiator > drivers. OK.
diff --git a/block/blk-core.c b/block/blk-core.c index 78a46794053e..463aa0ee36ce 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -126,6 +126,9 @@ void blk_rq_init(struct request_queue *q, struct request *rq) rq->start_time = jiffies; set_start_time_ns(rq); rq->part = NULL; + + if (q && q->initialize_rq_fn) + q->initialize_rq_fn(rq); } EXPORT_SYMBOL(blk_rq_init); @@ -1420,21 +1423,12 @@ static struct request *blk_old_get_request(struct request_queue *q, struct request *blk_get_request(struct request_queue *q, unsigned int op, gfp_t gfp_mask) { - struct request *req; - - if (q->mq_ops) { - req = blk_mq_alloc_request(q, op, + if (q->mq_ops) + return blk_mq_alloc_request(q, op, (gfp_mask & __GFP_DIRECT_RECLAIM) ? 0 : BLK_MQ_REQ_NOWAIT); - if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn) - q->mq_ops->initialize_rq_fn(req); - } else { - req = blk_old_get_request(q, op, gfp_mask); - if (!IS_ERR(req) && q->initialize_rq_fn) - q->initialize_rq_fn(req); - } - - return req; + else + return blk_old_get_request(q, op, gfp_mask); } EXPORT_SYMBOL(blk_get_request); diff --git a/block/blk-mq.c b/block/blk-mq.c index 21f2cff217ce..fdb33f8a2860 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -273,6 +273,7 @@ EXPORT_SYMBOL(blk_mq_can_queue); static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, unsigned int tag, unsigned int op) { + struct request_queue *q = data->q; struct blk_mq_tags *tags = blk_mq_tags_from_data(data); struct request *rq = tags->static_rqs[tag]; @@ -325,6 +326,9 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data, rq->end_io_data = NULL; rq->next_rq = NULL; + if (q->mq_ops->initialize_rq_fn) + q->mq_ops->initialize_rq_fn(rq); + data->ctx->rq_dispatched[op_is_sync(op)]++; return rq; }
Since .initialize_rq_fn() is called from inside blk_get_request() that function is only called for pass-through requests but not for filesystem requests. There is agreement in the SCSI community that the jiffies_at_alloc and retries members of struct scsi_cmnd should be initialized at request allocation time instead of when preparing a request. This will allow to preserve the value of these members when requeuing a request. Since the block layer does not yet allow block drivers to initialize filesystem requests without overriding request_queue.make_request_fn, move the .initialize_rq_fn() calls from blk_get_request() into blk_mq_rq_ctx_init() and blk_rq_init(). See also https://www.spinics.net/lists/linux-scsi/msg108934.html. Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.com> --- block/blk-core.c | 20 +++++++------------- block/blk-mq.c | 4 ++++ 2 files changed, 11 insertions(+), 13 deletions(-)