diff mbox

[4/5] scsi_io_completion: conditional hints on fastpath

Message ID 20180307032134.18978-5-dgilbert@interlog.com (mailing list archive)
State Superseded
Headers show

Commit Message

Douglas Gilbert March 7, 2018, 3:21 a.m. UTC
ChangeLog:
  - add likely() and unlikely() hints to conditionals on or near the
    fastpath
  - fix a comment in scsi_io_completion_action()

Signed-off-by: Douglas Gilbert <dgilbert@interlog.com>
---
 drivers/scsi/scsi_lib.c | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

Comments

Johannes Thumshirn March 9, 2018, 9:13 a.m. UTC | #1
On Tue, Mar 06, 2018 at 10:21:33PM -0500, Douglas Gilbert wrote:
> ChangeLog:
>   - add likely() and unlikely() hints to conditionals on or near the
>     fastpath
>   - fix a comment in scsi_io_completion_action()

Do you have any numbers what this patch buys us performance wise?

Thanks,
	Johannes
Douglas Gilbert March 14, 2018, 10:39 a.m. UTC | #2
On 2018-03-09 10:13 AM, Johannes Thumshirn wrote:
> On Tue, Mar 06, 2018 at 10:21:33PM -0500, Douglas Gilbert wrote:
>> ChangeLog:
>>    - add likely() and unlikely() hints to conditionals on or near the
>>      fastpath
>>    - fix a comment in scsi_io_completion_action()
> 
> Do you have any numbers what this patch buys us performance wise?

How does 7 nanoseconds sound?

Measuring the time taken by scsi_io_completion() gives figures that
jump around seemingly randomly between 1.2 and 2.0 microseconds
(when the system is under heavy load). Turns out that almost all of
that is caused by the call to scsi_end_request() (and the functions
it calls). So the timing I was after was from the call into
scsi_io_completion() to the point on the fastpath just before
scsi_end_request() is called. ftrace function-graph doesn't even show
this time (perhaps it has a lower threshold value).

So I wrote my own timer, starting and stopping at those points, reading
the hr timer with ktime and only timing on one CPU (to avoid atomics
an mutexes). The input was a fio script used previously for testing
scsi_mq.

The "before" timing was on a kernel build with MKP's tree which now has
my "[PATCH v4] Make SCSI Status CONDITION MET equivalent to GOOD"
patch applied, but not this patchset. Also some of the time being measured
is from the measuring code itself :-)

Here is the "before", averaged over 100 millisecond windows:
   hrtimer: interrupt took 16045 ns
   print_timings: cum_time=100000156 ns, count=630299, avg=158 ns
   print_timings: cum_time=100000126 ns, count=623939, avg=160 ns
   print_timings: cum_time=100000020 ns, count=625046, avg=159 ns
   print_timings: cum_time=100000098 ns, count=624375, avg=160 ns
   print_timings: cum_time=100000060 ns, count=619438, avg=161 ns
   print_timings: cum_time=100000132 ns, count=616604, avg=162 ns
   print_timings: cum_time=100000050 ns, count=617839, avg=161 ns
   print_timings: cum_time=100000079 ns, count=623943, avg=160 ns
   print_timings: cum_time=100000128 ns, count=617545, avg=161 ns
   print_timings: cum_time=100000083 ns, count=617902, avg=161 ns

Here is the "after":
   print_timings: cum_time=100000022 ns, count=628977, avg=158 ns
   print_timings: cum_time=100000021 ns, count=642545, avg=155 ns
   print_timings: cum_time=100000042 ns, count=647620, avg=154 ns
   print_timings: cum_time=100000115 ns, count=649420, avg=153 ns
   print_timings: cum_time=100000001 ns, count=646527, avg=154 ns
   print_timings: cum_time=100000103 ns, count=648608, avg=154 ns
   print_timings: cum_time=100000074 ns, count=650341, avg=153 ns
   print_timings: cum_time=100000115 ns, count=652359, avg=153 ns
   print_timings: cum_time=100000031 ns, count=636437, avg=157 ns
   print_timings: cum_time=100000096 ns, count=632879, avg=158 ns


Attached is some ftrace function-graph output of scsi_io_completion()
which is gzipped for brevity.

Looked at in lines of C code, in the window I'm interested in the
"before" code does:
    - 4 local variable initializations
    - zeros two locals, sets false in another
    - wait_for = (cmd->allowed + 1) * req->timeout;  (local initialization)
    - 5 failed "if"s (one is "else if") with no hints
    - fails to trigger on BUG_ON
    - doesn't print a SCSI_LOG_HLCOMPLETE
    - before it gets to the fastpath exit:
         if (!(blk_rq_bytes(req) == 0 && error) &&
             !scsi_end_request(req, error, good_bytes, 0))
                 return;

The "after" code does:
    - 4 local variable initializations
    - 2 failed "if"s with hints
    - one failed "if" with hint in place of BUG_ON
    - doesn't print a SCSI_LOG_HLCOMPLETE
    - before it gets to the fastpath exit which has hints:
         if (likely(blk_rq_bytes(req) > 0 || blk_stat == BLK_STS_OK)) {
                 if (likely(!scsi_end_request(req, blk_stat, good_bytes, 0)))
                         return; /* no bytes remaining */
         }

So the difference measured on a Lenovo X270 (i5-7200U) amounts to
around 7 nanoseconds.

Doug Gilbert
Johannes Thumshirn March 14, 2018, 1:57 p.m. UTC | #3
On Wed, Mar 14, 2018 at 11:39:47AM +0100, Douglas Gilbert wrote:
> So the difference measured on a Lenovo X270 (i5-7200U) amounts to
> around 7 nanoseconds.
> 
> Doug Gilbert

Wow, thanks a lot.

So it's actually buying us something measurable (not all too much though).

Looks good,
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
diff mbox

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d2d12a69508..23e59281bd02 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -818,7 +818,7 @@  static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
 	unsigned long wait_for = (cmd->allowed + 1) * req->timeout;
 	struct scsi_sense_hdr sshdr;
 	bool sense_valid_and_current = false;
-	blk_status_t blk_stat;		/* enum, BLK_STS_OK is 0 */
+	blk_status_t blk_stat;		/* u8, BLK_STS_OK is only 0 */
 
 	/* sense not about current command is termed: deferred */
 	if (scsi_command_normalize_sense(cmd, &sshdr) &&
@@ -1009,7 +1009,7 @@  void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	struct request *req = cmd->request;
 	blk_status_t blk_stat = BLK_STS_OK;	/* u8: BLK_STS_OK is only 0 */
 
-	if (result) {
+	if (unlikely(result)) {
 		blk_stat = scsi_io_completion_nz_result(cmd, result);
 		if (blk_stat == BLK_STS_OK)
 			result = 0;
@@ -1017,14 +1017,14 @@  void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 			blk_stat = BLK_STS_OK;
 	}
 
-	if (blk_rq_is_passthrough(req)) {
+	if (unlikely(blk_rq_is_passthrough(req))) {
 		/*
 		 * __scsi_error_from_host_byte may have reset the host_byte
 		 */
 		scsi_req(req)->result = cmd->result;
 		scsi_req(req)->resid_len = scsi_get_resid(cmd);
 
-		if (scsi_bidi_cmnd(cmd)) {
+		if (unlikely(scsi_bidi_cmnd(cmd))) {
 			/*
 			 * Bidi commands Must be complete as a whole,
 			 * both sides at once.
@@ -1037,7 +1037,7 @@  void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 		}
 	}
 
-	/* no bidi support for !blk_rq_is_passthrough yet */
+	/* no bidi support yet, other than in pass-through */
 	BUG_ON(blk_bidi_rq(req));
 
 	SCSI_LOG_HLCOMPLETE(1, scmd_printk(KERN_INFO, cmd,
@@ -1049,15 +1049,13 @@  void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	 * handle. Failed, zero length commands always need to drop down
 	 * to retry code. Fast path should return in this block.
 	 */
-	if (blk_rq_bytes(req) > 0 || blk_stat == BLK_STS_OK) {
-		if (!scsi_end_request(req, blk_stat, good_bytes, 0))
-			return; /* no bytes remaining */
+	if (likely(blk_rq_bytes(req) > 0 || blk_stat == BLK_STS_OK)) {
+		if (likely(!scsi_end_request(req, blk_stat, good_bytes, 0)))
+			return;	/* no bytes remaining */
 	}
 
-	/*
-	 * Kill remainder if no retrys.
-	 */
-	if (blk_stat && scsi_noretry_cmd(cmd)) {
+	/* Kill remainder if no retrys  */
+	if (unlikely(blk_stat && scsi_noretry_cmd(cmd))) {
 		if (scsi_end_request(req, blk_stat, blk_rq_bytes(req), 0))
 			BUG();
 		return;
@@ -1067,14 +1065,14 @@  void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
 	 * If there had been no error, but we have leftover bytes in the
 	 * requeues just queue the command up again.
 	 */
-	if (result == 0) {
+	if (likely(result == 0)) {
 		/*
 		 * Unprep the request and put it back at the head of the
 		 * queue. A new command will be prepared and issued.
 		 * This block is the same as case ACTION_REPREP in
 		 * scsi_io_completion_action() above.
 		 */
-		if (q->mq_ops)
+		if (likely(q->mq_ops))
 			scsi_mq_requeue_cmd(cmd);
 		else {
 			scsi_release_buffers(cmd);