diff mbox

block: Call .initialize_rq_fn() also for filesystem requests

Message ID 20170825190057.15592-1-bart.vanassche@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche Aug. 25, 2017, 7 p.m. UTC
Since .initialize_rq_fn() is called from inside blk_get_request()
that function is only called for pass-through requests but not for
filesystem requests. There is agreement in the SCSI community that
the jiffies_at_alloc and retries members of struct scsi_cmnd
should be initialized at request allocation time instead of when
preparing a request. This will allow to preserve the value of these
members when requeuing a request. Since the block layer does not
yet allow block drivers to initialize filesystem requests without
overriding request_queue.make_request_fn, move the
.initialize_rq_fn() calls from blk_get_request() into
blk_mq_rq_ctx_init() and blk_rq_init().

See also https://www.spinics.net/lists/linux-scsi/msg108934.html.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Hannes Reinecke <hare@suse.com>
---
 block/blk-core.c | 20 +++++++-------------
 block/blk-mq.c   |  4 ++++
 2 files changed, 11 insertions(+), 13 deletions(-)

Comments

Christoph Hellwig Aug. 28, 2017, 8:10 a.m. UTC | #1
I still disagree that we should have an indirect function call like this
in the fast path.

All that can be done by clearing or setting a flag on the first call to
->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
that, but for SCSI we probablt can't use that given that it has more
meaning for the old request path.  But how about just adding a new
RQD_DRV_INITIALIZED or similar flag?
Bart Van Assche Aug. 28, 2017, 6:31 p.m. UTC | #2
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> I still disagree that we should have an indirect function call like this

> in the fast path.

> 

> All that can be done by clearing or setting a flag on the first call to

> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for

> that, but for SCSI we probably can't use that given that it has more

> meaning for the old request path.  But how about just adding a new

> RQD_DRV_INITIALIZED or similar flag?


Hello Christoph,

Sorry but I'm not enthusiast about the proposal to introduce a
RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying
behavior difference, namely that .initialize_rq_fn() would be called from
inside blk_get_request() for pass-through requests and from inside the prep
function for filesystem requests. Another disadvantage of that approach is
that the block layer core never clears request.atomic_flags completely but
only sets and clears individual flags. The SCSI core would have to follow
that model and hence code for clearing RQD_DRV_INITIALIZED would have to be
added to all request completion paths in the SCSI core.

Have you noticed that Ming Lei's patch series introduces several new atomic
operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY
manipulations. Have you noticed that for SCSI drivers these patches introduce
an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path?
I have derived these numbers from the random write SRP performance numbers
as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be
multiplied with the number of I/O requests processed in parallel. The number
of jobs in Ming Lei's test was 64 but that's probably way higher than the
actual I/O parallelism.

Have you noticed that my patch did not add any atomic instructions to the hot
path but only a read of a function pointer that should already be cache hot?
As you know modern CPUs are good at predicting branches. Are you sure that my
patch will have a measurable performance impact?

Bart.
Ming Lei Aug. 29, 2017, 11:16 a.m. UTC | #3
On Mon, Aug 28, 2017 at 06:31:33PM +0000, Bart Van Assche wrote:
> On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> > I still disagree that we should have an indirect function call like this
> > in the fast path.
> > 
> > All that can be done by clearing or setting a flag on the first call to
> > ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
> > that, but for SCSI we probably can't use that given that it has more
> > meaning for the old request path.  But how about just adding a new
> > RQD_DRV_INITIALIZED or similar flag?
> 
> Hello Christoph,
> 
> Sorry but I'm not enthusiast about the proposal to introduce a
> RQD_DRV_INITIALIZED or similar flag. That approach involves an annoying
> behavior difference, namely that .initialize_rq_fn() would be called from
> inside blk_get_request() for pass-through requests and from inside the prep
> function for filesystem requests. Another disadvantage of that approach is
> that the block layer core never clears request.atomic_flags completely but
> only sets and clears individual flags. The SCSI core would have to follow
> that model and hence code for clearing RQD_DRV_INITIALIZED would have to be
> added to all request completion paths in the SCSI core.
> 
> Have you noticed that Ming Lei's patch series introduces several new atomic
> operations in the hot path? I'm referring here to the BLK_MQ_S_DISPATCH_BUSY
> manipulations. Have you noticed that for SCSI drivers these patches introduce
> an overhead between 0.1 and 1.0 microseconds per I/O request in the hot path?
> I have derived these numbers from the random write SRP performance numbers
> as follows: 1/142460 - 1/142990 = 2.6 microseconds. That number has to be
> multiplied with the number of I/O requests processed in parallel. The number
> of jobs in Ming Lei's test was 64 but that's probably way higher than the
> actual I/O parallelism.

Hi Bart,

Did you see perf regression on SRP with smaller jobs after applying
my patchset V3?

I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.
v4.13-rc6+blk-next+patch V3(3rd column):


---------------------------------------
 IOPS(K)  |    NONE     |    NONE
---------------------------------------
read      |      475.83 |      485.88
---------------------------------------
randread  |      142.86 |      141.96
---------------------------------------
write     |       483.9 |      492.39
---------------------------------------
randwrite |      124.83 |      124.53
---------------------------------------



---------------------------------------
 CPU(%)   |    NONE     |    NONE
---------------------------------------
read      |       15.43 |       15.81
---------------------------------------
randread  |        9.87 |        9.75
---------------------------------------
write     |       17.67 |       18.34
---------------------------------------
randwrite |       10.96 |       10.56
---------------------------------------



---------------------------------------
 LAT(us)  |    NONE     |    NONE
---------------------------------------
read      |        2.15 |        2.11
---------------------------------------
randread  |        7.17 |        7.21
---------------------------------------
write     |        2.11 |        2.08
---------------------------------------
randwrite |         8.2 |        8.22
---------------------------------------



---------------------------------------
MERGE(K)  |    NONE     |    NONE
---------------------------------------
read      |     8798.59 |     9064.09
---------------------------------------
randread  |        0.02 |        0.19
---------------------------------------
write     |     8847.73 |     9102.18
---------------------------------------
randwrite |        0.03 |        0.13
---------------------------------------
Christoph Hellwig Aug. 29, 2017, 1:12 p.m. UTC | #4
Hi Bart,

this isn't just about performance - it's about understanability of
the I/O path.

The legacy request path already has the prep_fn, which is intended
for exactly this sort of prep work, but even that led to a lot
of confusion.  For blk-mq we decided to not add it but let the called
driver in control.  I really don't want to move away from that.

The passthrough requests using blk_get_requst are a special case
as the caller allocates the requests, stuffs data into them and only
then hands control to the driver, and thus we need some way to
initialize the request before handing controller to the driver in
this particular special case.  And if needed would could actually
do that with explicit calls instead of the callback, although you
changed it to save a bit of code.
Bart Van Assche Aug. 29, 2017, 8:15 p.m. UTC | #5
On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote:
> Hi Bart,

> 

> Did you see perf regression on SRP with smaller jobs after applying

> my patchset V3?

> 

> I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,

> looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.

> v4.13-rc6+blk-next+patch V3(3rd column):

> 

> 

> ---------------------------------------

>  IOPS(K)  |    NONE     |    NONE

> ---------------------------------------

> read      |      475.83 |      485.88

> ---------------------------------------

> randread  |      142.86 |      141.96

> ---------------------------------------

> write     |       483.9 |      492.39

> ---------------------------------------

> randwrite |      124.83 |      124.53

> ---------------------------------------

> [ ... ]

> ---------------------------------------

>  LAT(us)  |    NONE     |    NONE

> ---------------------------------------

> read      |        2.15 |        2.11

> ---------------------------------------

> randread  |        7.17 |        7.21

> ---------------------------------------

> write     |        2.11 |        2.08

> ---------------------------------------

> randwrite |         8.2 |        8.22

> ---------------------------------------

> [ ... ]


Hello Ming,

Although I would prefer to see measurement data against an SRP target system
that supports a higher workload (>1M IOPS) and also for a high-end NVMe drive,
I think the above data is sufficient to show that the performance impact of
your patch series is most likely small enough even for high-end SCSI initiator
drivers.

Bart.
Bart Van Assche Aug. 29, 2017, 8:57 p.m. UTC | #6
On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
> All that can be done by clearing or setting a flag on the first call to

> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for

> that, but for SCSI we probablt can't use that given that it has more

> meaning for the old request path.  But how about just adding a new

> RQD_DRV_INITIALIZED or similar flag?


Hello Christoph,

More changes would have to be made to implement the above proposal than just
setting a flag at the start of .queue_rq() / .queuecommand() for all
filesystem requests. The numerous code paths that lead to a request not being
executed immediately, e.g. a SCSI host being busy or a preparation status !=
BLKPREP_OK, would have to be inspected and code would have to be added to clear
the "command initialized" flag to ensure that initialization occurs shortly
before the first time a command is executed.

The choice we have to make is to add more state information and complicated
code for keeping that state information up-to-date to the SCSI core or making a
small and easy to maintain change to the block layer core that does not involve
any new state information. That's why I'm asking you to reconsider the patch at
the start of this e-mail thread.

Thanks,

Bart.
Jens Axboe Aug. 29, 2017, 9:24 p.m. UTC | #7
On 08/29/2017 02:57 PM, Bart Van Assche wrote:
> On Mon, 2017-08-28 at 10:10 +0200, Christoph Hellwig wrote:
>> All that can be done by clearing or setting a flag on the first call to
>> ->queue_rq or ->queuecommand instead.  In NVMe we use RQF_DONTPREP for
>> that, but for SCSI we probablt can't use that given that it has more
>> meaning for the old request path.  But how about just adding a new
>> RQD_DRV_INITIALIZED or similar flag?
> 
> Hello Christoph,
> 
> More changes would have to be made to implement the above proposal
> than just setting a flag at the start of .queue_rq() / .queuecommand()
> for all filesystem requests. The numerous code paths that lead to a
> request not being executed immediately, e.g. a SCSI host being busy or
> a preparation status != BLKPREP_OK, would have to be inspected and
> code would have to be added to clear the "command initialized" flag to
> ensure that initialization occurs shortly before the first time a
> command is executed.
> 
> The choice we have to make is to add more state information and
> complicated code for keeping that state information up-to-date to the
> SCSI core or making a small and easy to maintain change to the block
> layer core that does not involve any new state information. That's why
> I'm asking you to reconsider the patch at the start of this e-mail
> thread.

I don't like this addition, and honestly I wasn't a huge fan of adding
->init() hooks to the alloc path when we initially added this earlier
this summer. The fact that we now have to make this more invasive
doesn't improve the situation at all.

So fwiw, I too would much rather see an implementation based on an RQF_
flag instead.
Ming Lei Aug. 30, 2017, 3:01 a.m. UTC | #8
On Tue, Aug 29, 2017 at 08:15:23PM +0000, Bart Van Assche wrote:
> On Tue, 2017-08-29 at 19:16 +0800, Ming Lei wrote:
> > Hi Bart,
> > 
> > Did you see perf regression on SRP with smaller jobs after applying
> > my patchset V3?
> > 
> > I just run the test with 16 jobs(the system has 16 CPU cores) instead of 64,
> > looks not see perf regression on SRP about v4.13-rc6+blk-next(2nd column) VS.
> > v4.13-rc6+blk-next+patch V3(3rd column):
> > 
> > 
> > ---------------------------------------
> >  IOPS(K)  |    NONE     |    NONE
> > ---------------------------------------
> > read      |      475.83 |      485.88
> > ---------------------------------------
> > randread  |      142.86 |      141.96
> > ---------------------------------------
> > write     |       483.9 |      492.39
> > ---------------------------------------
> > randwrite |      124.83 |      124.53
> > ---------------------------------------
> > [ ... ]
> > ---------------------------------------
> >  LAT(us)  |    NONE     |    NONE
> > ---------------------------------------
> > read      |        2.15 |        2.11
> > ---------------------------------------
> > randread  |        7.17 |        7.21
> > ---------------------------------------
> > write     |        2.11 |        2.08
> > ---------------------------------------
> > randwrite |         8.2 |        8.22
> > ---------------------------------------
> > [ ... ]
> 
> Hello Ming,
> 
> Although I would prefer to see measurement data against an SRP target system
> that supports a higher workload (>1M IOPS) and

Hi Bart,

For so higher workload, I guess it often requires to increase
.cmd_per_lun.

> also for a high-end NVMe drive,

My patch won't affect NVMe drive since NVMe driver doesn't become busy
usually.


> I think the above data is sufficient to show that the performance impact of
> your patch series is most likely small enough even for high-end SCSI initiator
> drivers.

OK.
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 78a46794053e..463aa0ee36ce 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,9 @@  void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
+
+	if (q && q->initialize_rq_fn)
+		q->initialize_rq_fn(rq);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1420,21 +1423,12 @@  static struct request *blk_old_get_request(struct request_queue *q,
 struct request *blk_get_request(struct request_queue *q, unsigned int op,
 				gfp_t gfp_mask)
 {
-	struct request *req;
-
-	if (q->mq_ops) {
-		req = blk_mq_alloc_request(q, op,
+	if (q->mq_ops)
+		return blk_mq_alloc_request(q, op,
 			(gfp_mask & __GFP_DIRECT_RECLAIM) ?
 				0 : BLK_MQ_REQ_NOWAIT);
-		if (!IS_ERR(req) && q->mq_ops->initialize_rq_fn)
-			q->mq_ops->initialize_rq_fn(req);
-	} else {
-		req = blk_old_get_request(q, op, gfp_mask);
-		if (!IS_ERR(req) && q->initialize_rq_fn)
-			q->initialize_rq_fn(req);
-	}
-
-	return req;
+	else
+		return blk_old_get_request(q, op, gfp_mask);
 }
 EXPORT_SYMBOL(blk_get_request);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 21f2cff217ce..fdb33f8a2860 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -273,6 +273,7 @@  EXPORT_SYMBOL(blk_mq_can_queue);
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		unsigned int tag, unsigned int op)
 {
+	struct request_queue *q = data->q;
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct request *rq = tags->static_rqs[tag];
 
@@ -325,6 +326,9 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->end_io_data = NULL;
 	rq->next_rq = NULL;
 
+	if (q->mq_ops->initialize_rq_fn)
+		q->mq_ops->initialize_rq_fn(rq);
+
 	data->ctx->rq_dispatched[op_is_sync(op)]++;
 	return rq;
 }