diff mbox series

scsi: core: Log SCSI command age with errors

Message ID 20190923060122.GA9603@machine1 (mailing list archive)
State Superseded
Headers show
Series scsi: core: Log SCSI command age with errors | expand

Commit Message

Milan P. Gandhi Sept. 23, 2019, 6:01 a.m. UTC
Couple of users had requested to print the SCSI command age along 
with command failure errors. This is a small change, but allows 
users to get more important information about the command that was 
failed, it would help the users in debugging the command failures:

Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
---

Comments

Laurence Oberman Sept. 23, 2019, 1:02 p.m. UTC | #1
On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along 
> with command failure errors. This is a small change, but allows 
> users to get more important information about the command that was 
> failed, it would help the users in debugging the command failures:
> 
> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>  	const char *db_string = scsi_driverbyte_string(cmd->result);
> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>  
>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>  	if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  
>  	if (db_string)
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=%s", db_string);
> +				 "driverbyte=%s ", db_string);
>  	else
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> +				 "driverbyte=0x%02x ",
> +				 driver_byte(cmd->result));
> +
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);
> +
>  out_printk:
>  	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
>  	scsi_log_release_buffer(logbuf);
> 

This looks to be a useful debug addition to me, and the code looks
correct.
I believe this has also been tested by Milan in our lab.

Reviewed-by: Laurence Oberman <loberman@redhat.com>
Milan P. Gandhi Sept. 23, 2019, 1:24 p.m. UTC | #2
On 9/23/19 6:32 PM, Laurence Oberman wrote:
> On Mon, 2019-09-23 at 11:31 +0530, Milan P. Gandhi wrote:
>> Couple of users had requested to print the SCSI command age along 
>> with command failure errors. This is a small change, but allows 
>> users to get more important information about the command that was 
>> failed, it would help the users in debugging the command failures:
>>
>> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
>> ---
>> diff --git a/drivers/scsi/scsi_logging.c
>> b/drivers/scsi/scsi_logging.c
>> index ecc5918e372a..ca2182bc53c6 100644
>> --- a/drivers/scsi/scsi_logging.c
>> +++ b/drivers/scsi/scsi_logging.c
>> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>>  	const char *db_string = scsi_driverbyte_string(cmd->result);
>> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
>>  
>>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>>  	if (!logbuf)
>> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
>> *cmd, const char *msg,
>>  
>>  	if (db_string)
>>  		off += scnprintf(logbuf + off, logbuf_len - off,
>> -				 "driverbyte=%s", db_string);
>> +				 "driverbyte=%s ", db_string);
>>  	else
>>  		off += scnprintf(logbuf + off, logbuf_len - off,
>> -				 "driverbyte=0x%02x", driver_byte(cmd-
>>> result));
>> +				 "driverbyte=0x%02x ",
>> +				 driver_byte(cmd->result));
>> +
>> +	off += scnprintf(logbuf + off, logbuf_len - off,
>> +			 "cmd-age=%lus", cmd_age);
>> +
>>  out_printk:
>>  	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
>>  	scsi_log_release_buffer(logbuf);
>>
> 
> This looks to be a useful debug addition to me, and the code looks
> correct.
> I believe this has also been tested by Milan in our lab.
> 
> Reviewed-by: Laurence Oberman <loberman@redhat.com> 
> 
Yes, the patch was tested locally using scsi_debug as well as in real 
storage issues caused by bad disks in customer environment.

Thanks,
Milan.
Bart Van Assche Sept. 25, 2019, 3:45 p.m. UTC | #3
On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);

Have you considered to change cmd-age into cmd_age? I'm afraid otherwise 
someone might interpret the hyphen as a subtraction sign...

Thanks,

Bart.
Milan P. Gandhi Sept. 26, 2019, 5:01 a.m. UTC | #4
On 9/25/19 9:15 PM, Bart Van Assche wrote:
> On 9/22/19 11:01 PM, Milan P. Gandhi wrote:
>> +    off += scnprintf(logbuf + off, logbuf_len - off,
>> +             "cmd-age=%lus", cmd_age);
> 
> Have you considered to change cmd-age into cmd_age? I'm afraid otherwise someone might interpret the hyphen as a subtraction sign...

Thanks for your suggestion Bart.
Yes, it would be better to use cmd_age to avoid any confusion.
I will change it and send the updated patch.

Thanks,
Milan.
Martin Wilck Sept. 27, 2019, 3:26 p.m. UTC | #5
On Mon, 2019-09-23 at 11:31 +0530,  Milan P. Gandhi wrote:
> Couple of users had requested to print the SCSI command age along 
> with command failure errors. This is a small change, but allows 
> users to get more important information about the command that was 
> failed, it would help the users in debugging the command failures:
> 
> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> ---
> diff --git a/drivers/scsi/scsi_logging.c
> b/drivers/scsi/scsi_logging.c
> index ecc5918e372a..ca2182bc53c6 100644
> --- a/drivers/scsi/scsi_logging.c
> +++ b/drivers/scsi/scsi_logging.c
> @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  	const char *mlret_string = scsi_mlreturn_string(disposition);
>  	const char *hb_string = scsi_hostbyte_string(cmd->result);
>  	const char *db_string = scsi_driverbyte_string(cmd->result);
> +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;

This comes down to pretty coarse time resolution, does it not? More
often than not, the time difference shown will be 0. I'd recommend at
least millisecond resolution.

>  
>  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
>  	if (!logbuf)
> @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> *cmd, const char *msg,
>  
>  	if (db_string)
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=%s", db_string);
> +				 "driverbyte=%s ", db_string);
>  	else
>  		off += scnprintf(logbuf + off, logbuf_len - off,
> -				 "driverbyte=0x%02x", driver_byte(cmd-
> >result));
> +				 "driverbyte=0x%02x ",
> +				 driver_byte(cmd->result));
> +
> +	off += scnprintf(logbuf + off, logbuf_len - off,
> +			 "cmd-age=%lus", cmd_age);

This is certainly helpful in some situations. Yet I am unsure if it
should *always* be printed. I wouldn't say it's as important as the 
other stuff scsi_print_result() provides. After all, by activating
MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
accuracy from currently available logs. 

So perhaps make this depend on a module parameter?

Also, we should carefully ponder if we want to put this on the same
line as the driver byte, as users may have created scripts for parsing
this output.

Regards,
Martin
Laurence Oberman Sept. 27, 2019, 5:45 p.m. UTC | #6
On Fri, 2019-09-27 at 17:26 +0200, Martin Wilck wrote:
> On Mon, 2019-09-23 at 11:31 +0530,  Milan P. Gandhi wrote:
> > Couple of users had requested to print the SCSI command age along 
> > with command failure errors. This is a small change, but allows 
> > users to get more important information about the command that was 
> > failed, it would help the users in debugging the command failures:
> > 
> > Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com>
> > ---
> > diff --git a/drivers/scsi/scsi_logging.c
> > b/drivers/scsi/scsi_logging.c
> > index ecc5918e372a..ca2182bc53c6 100644
> > --- a/drivers/scsi/scsi_logging.c
> > +++ b/drivers/scsi/scsi_logging.c
> > @@ -437,6 +437,7 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> >  	const char *mlret_string = scsi_mlreturn_string(disposition);
> >  	const char *hb_string = scsi_hostbyte_string(cmd->result);
> >  	const char *db_string = scsi_driverbyte_string(cmd->result);
> > +	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
> 
> This comes down to pretty coarse time resolution, does it not? More
> often than not, the time difference shown will be 0. I'd recommend at
> least millisecond resolution.
> 
> >  
> >  	logbuf = scsi_log_reserve_buffer(&logbuf_len);
> >  	if (!logbuf)
> > @@ -478,10 +479,15 @@ void scsi_print_result(const struct scsi_cmnd
> > *cmd, const char *msg,
> >  
> >  	if (db_string)
> >  		off += scnprintf(logbuf + off, logbuf_len - off,
> > -				 "driverbyte=%s", db_string);
> > +				 "driverbyte=%s ", db_string);
> >  	else
> >  		off += scnprintf(logbuf + off, logbuf_len - off,
> > -				 "driverbyte=0x%02x", driver_byte(cmd-
> > > result));
> > 
> > +				 "driverbyte=0x%02x ",
> > +				 driver_byte(cmd->result));
> > +
> > +	off += scnprintf(logbuf + off, logbuf_len - off,
> > +			 "cmd-age=%lus", cmd_age);
> 
> This is certainly helpful in some situations. Yet I am unsure if it
> should *always* be printed. I wouldn't say it's as important as the 
> other stuff scsi_print_result() provides. After all, by activating
> MLQUEUE+MLCOMPLETE, the time on-wire can be extracted with better
> accuracy from currently available logs. 
> 
> So perhaps make this depend on a module parameter?
> 
> Also, we should carefully ponder if we want to put this on the same
> line as the driver byte, as users may have created scripts for
> parsing
> this output.
> 
> Regards,
> Martin
> 
> 
Hi Martin

Agreed about log extraction, but turning that on with a busy workload
in a production environment is not practical. We cant do it with
systems with 1000's of luns and 1000's of IOPS/sec.
Also second resolution is good enough for the debug we want to see.

Regards
Laurence
Martin Wilck Sept. 30, 2019, 8:42 a.m. UTC | #7
On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
> 
> Hi Martin
> 
> Agreed about log extraction, but turning that on with a busy workload
> in a production environment is not practical. We cant do it with
> systems with 1000's of luns and 1000's of IOPS/sec.
> Also second resolution is good enough for the debug we want to see.

I gather that you look at a specific problem where second resolution is
sufficient. For upstream, the generic usefulness should be considered,
and I don't think we can say today that better-than-second resolution
will never be useful, so I still vote for milliseconds.

Wrt the enablement of the option on highly loaded systems, I'm not sure
I understand. You need to enable SCSI logging anyway, don't you? Is it
an issue to have to set 2 sysfs values rather than just one?

Regards,
Martin

> 
> Regards
> Laurence
>
Milan P. Gandhi Sept. 30, 2019, 9:05 a.m. UTC | #8
On 9/30/19 2:12 PM, Martin Wilck wrote:
> On Fri, 2019-09-27 at 13:45 -0400, Laurence Oberman wrote:
>>
>> Hi Martin
>>
>> Agreed about log extraction, but turning that on with a busy workload
>> in a production environment is not practical. We cant do it with
>> systems with 1000's of luns and 1000's of IOPS/sec.
>> Also second resolution is good enough for the debug we want to see.
> 
> I gather that you look at a specific problem where second resolution is
> sufficient. For upstream, the generic usefulness should be considered,
> and I don't think we can say today that better-than-second resolution
> will never be useful, so I still vote for milliseconds.

Ok, I will change it to ms.

> Wrt the enablement of the option on highly loaded systems, I'm not sure
> I understand. You need to enable SCSI logging anyway, don't you?

By default we keep the SCSI debug logging disabled or am I missing 
something?

>Is it an issue to have to set 2 sysfs values rather than just one?

The idea here is to capture the above debug data even without 
any user interventions to change any sysfs entries or to enable 
debug logging on busy, critical production systems.

Also, we are not changing the existing text in SCSI command error log,
but we are only adding one single word at the end of message. Ideally
the user scripts are written to grep specific pattern from the logs.
Since we are not replacing any existing text from the logs, the 
scripts should still work with this change as well.

Thanks,
Milan.
Martin Wilck Sept. 30, 2019, 9:46 a.m. UTC | #9
Hello Milan,

On Mon, 2019-09-30 at 14:35 +0530, Milan P. Gandhi wrote:
> On 9/30/19 2:12 PM, Martin Wilck wrote:

> > Wrt the enablement of the option on highly loaded systems, I'm not
> > sure
> > I understand. You need to enable SCSI logging anyway, don't you?
> 
> By default we keep the SCSI debug logging disabled or am I missing 
> something?
> 
> > Is it an issue to have to set 2 sysfs values rather than just one?
> 
> The idea here is to capture the above debug data even without 
> any user interventions to change any sysfs entries or to enable 
> debug logging on busy, critical production systems.

So, you're looking at the scsi_io_completion() code path. In my
experience that isn't reliable for bug hunting because of the the
message rate limiting. Therefore I prefer using SCSI logging
MLCOMPLETE=1, where no rate limiting applies. But that's just a side
note, it depends on the case what's more useful.

Back to the cmd age output, IMO we're are on a thin line between
capturing useful information and keeping the logs neat. As I already
said, I'm not convinced that this information, as important it may be
for the case(s) you're currently investigating, has the same generic
degree of importance or usefulness as what's currently printed (the CDB
and the sense data). But OTOH, that's just a gut feeling, and I can't
claim to have the experience to make general statement on it. If noone
else has issues with this being printed by default, I'm not going
oppose it. 

> Also, we are not changing the existing text in SCSI command error
> log,
> but we are only adding one single word at the end of message. Ideally
> the user scripts are written to grep specific pattern from the logs.
> Since we are not replacing any existing text from the logs, the 
> scripts should still work with this change as well.

You are certainly aware that such scripts don't necessarily conform to
what kernel developers would consider "ideal" :-) But again, I just
wanted to raise the issue; if noone else thinks it matters, fine with
me.

Thanks
Martin
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_logging.c b/drivers/scsi/scsi_logging.c
index ecc5918e372a..ca2182bc53c6 100644
--- a/drivers/scsi/scsi_logging.c
+++ b/drivers/scsi/scsi_logging.c
@@ -437,6 +437,7 @@  void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,
 	const char *mlret_string = scsi_mlreturn_string(disposition);
 	const char *hb_string = scsi_hostbyte_string(cmd->result);
 	const char *db_string = scsi_driverbyte_string(cmd->result);
+	unsigned long cmd_age = (jiffies - cmd->jiffies_at_alloc) / HZ;
 
 	logbuf = scsi_log_reserve_buffer(&logbuf_len);
 	if (!logbuf)
@@ -478,10 +479,15 @@  void scsi_print_result(const struct scsi_cmnd *cmd, const char *msg,
 
 	if (db_string)
 		off += scnprintf(logbuf + off, logbuf_len - off,
-				 "driverbyte=%s", db_string);
+				 "driverbyte=%s ", db_string);
 	else
 		off += scnprintf(logbuf + off, logbuf_len - off,
-				 "driverbyte=0x%02x", driver_byte(cmd->result));
+				 "driverbyte=0x%02x ",
+				 driver_byte(cmd->result));
+
+	off += scnprintf(logbuf + off, logbuf_len - off,
+			 "cmd-age=%lus", cmd_age);
+
 out_printk:
 	dev_printk(KERN_INFO, &cmd->device->sdev_gendev, "%s", logbuf);
 	scsi_log_release_buffer(logbuf);